A programming model and foundation for lineage-based distributed computation

PHILIPP HALLER; HEATHER MILLER; NORMEN MÜLLER

doi:10.1017/S0956796818000035

A programming model and foundation for lineage-based distributed computation

Part of: Big Data Special Collection

Published online by Cambridge University Press: 12 March 2018

PHILIPP HALLER ,

HEATHER MILLER and

NORMEN MÜLLER

Show author details

PHILIPP HALLER: Affiliation:
School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden (e-mail: phaller@kth.se)
HEATHER MILLER: Affiliation:
School of Computer and Communication Sciences, EPFL, CH-1015 Lausanne, Switzerland College of Computer and Information Science, Northeastern University, Boston, MA-02115, USA (e-mail: heather.miller@epfl.ch)
NORMEN MÜLLER: Affiliation:
Safeplace, DE-40667 Meerbusch, Germany

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

The most successful systems for “big data” processing have all adopted functional APIs. We present a new programming model, we call function passing, designed to provide a more principled substrate, or middleware, upon which to build data-centric distributed systems like Spark. A key idea is to build up a persistent functional data structure representing transformations on distributed immutable data by passing well-typed serializable functions over the wire and applying them to this distributed data. Thus, the function passing model can be thought of as a persistent functional data structure that is distributed, where transformations performed on distributed data are stored in its nodes rather than the distributed data itself. One advantage of this model is that failure recovery is simplified by design – data can be recovered by replaying function applications atop immutable data loaded from stable storage. Deferred evaluation is also central to our model; by incorporating deferred evaluation into our design only at the point of initiating network communication, the function passing model remains easy to reason about while remaining efficient in time and memory. Moreover, we provide a complete formalization of the programming model in order to study the foundations of lineage-based distributed computation. In particular, we develop a theory of safe, mobile lineages based on a subject reduction theorem for a typed core language. Furthermore, we formalize a progress theorem that guarantees the finite materialization of remote, lineage-based data. Thus, the formal model may serve as a basis for further developments of the theory of data-centric distributed programming, including aspects such as fault tolerance. We provide an open-source implementation of our model in and for the Scala programming language, along with a case study of several example frameworks and end-user programs written atop this model.

Information

Type: Research Article
Information: Journal of Functional Programming , Volume 28 , 2018 , e7

DOI: https://doi.org/10.1017/S0956796818000035 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

References

Agha, G. (1986) ACTORS: A Model of Concurrent Computation in Distributed Systems. Cambridge, MA, USA: MIT Press.CrossRef Google Scholar

Agha, G. A., Mason, I. A., Smith, S. F. & Talcott, C. L. (1997) A foundation for actor computation. J. Funct. Prog. 7(1), 1–72.Google Scholar

Apache. (2015) Hadoop. Available at: http://hadoop.apache.org/, accessed January 30, 2018.Google Scholar

Billings, J., Sewell, P., Shinwell, M. & Strniša, R. (2006) Type-safe distributed programming for OCaml. In Proceedings of the 2006 Workshop on ML. New York, NY, USA: ACM, pp. 20–31.CrossRef Google Scholar

Chambers, C., Raniwala, A., Perry, F., Adams, S, Henry, R. R., Bradshaw, R. & Weizenbaum, N. (2010) FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. New York NY, USA: ACM. pp. 363–375.Google Scholar

Dean, J. & Ghemawat, S. (2008) MapReduce: Simplified data processing on large clusters. Commun. ACM 51 (1), 107–113.Google Scholar

Dzik, J., Palladinos, N., Rontogiannis, K., Tsarpalis, E. & Vathis, N. (2013) MBrace: Cloud computing with monads. In PLOS@SOSP, Harris, T. & Madhavapeddy, A. (eds). New York, NY, USA: ACM.Google Scholar

Elsman, M. (2005) Type-specialized serialization with sharing. In Proceedings of the Symposium on Trends in Functional Programming, pp. 47–62.Google Scholar

Epstein, J., Black, A. P. & Jones, S. L. P. (2011) Towards Haskell in the cloud. In Proceedings of the Haskell Symposium, pp. 118–129.Google Scholar

Germain, G. (2006) Concurrency oriented programming in Termite Scheme. In Proceedings of the 2006 ACM SIGPLAN workshop on Erlang, p. 20.Google Scholar

Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y. & Zhuang, L. (2010) Nectar: Automatic management of data and computation in datacenters. In OSDI, Arpaci-Dusseau, R. H. & Chen, B. (eds). Berkeley, CA, USA: USENIX Association, pp. 75–88.Google Scholar

Haller, P. & Loiko, A. (2016) LaCasa: Lightweight affinity and object capabilities in Scala. In OOPSLA, Visser, E. & Smaragdakis, Y. (eds). New York, NY, USA: ACM, pp. 272–291.Google Scholar

Haller, P. & Odersky, M. (2009) Scala actors: Unifying thread-based and event-based programming. Theor. Comput. Sci. 410(2), 202–220.CrossRef Google Scholar

Haller, P. & Odersky, M. (2010) Capabilities for uniqueness and borrowing. In Proceedings of the European Conference on Object-Oriented Programming, Maribor, Slovenia, June 21–25, 2010, pp. 354–378.Google Scholar

Haller, P., Prokopec, A., Miller, H., Klang, V., Kuhn, R. & Jovanovic, V. (2012) Futures and promises. Available at: http://docs.scala-lang.org/overviews/core/futures.html, accessed January 30, 2018.Google Scholar

He, J., Wadler, P. & Trinder, P. (2014) Typecasting actors: From Akka to TAkka. In Proceedings of the 5th Scala Workshop. New York, NY, USA: ACM, pp. 23–33.Google Scholar

Herhut, S., Hudson, R. L., Shpeisman, T. & Sreeram, J. (2013) River Trail: A path to parallelism in JavaScript. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. New York, NY, USA: ACM, pp. 729–744.CrossRef Google Scholar

Hickey, R. (2008) The Clojure programming language. In Proceedings of the Dynamic Languages Symposium. New York, NY, USA: ACM, p. 1.Google Scholar

Isard, M., Budiu, M., Yu, Y., Birrell, A. & Fetterly, D. (2007) Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems. New York, NY, USA: ACM, pp. 59–72.CrossRef Google Scholar

Kennedy, A. (2004) Pickler combinators. J. Funct. Program. 14 (6), 727–739.CrossRef Google Scholar

Matsakis, N. D. (2012) Parallel closures: A new twist on an old idea. In Proceedings of the 4th USENIX Workshop on Hot Topics in Parallelism, Boehm, H.-J. & Ceze, L. (eds), HotPar. Berkeley, CA, USA: USENIX Association, p. 5.Google Scholar

Miller, H., Haller, P., Burmako, E. & Odersky, M. (2013) Instant pickles: Generating object-oriented pickler combinators for fast and extensible serialization. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. New York, NY, USA: ACM, pp. 183–202.CrossRef Google Scholar

Miller, H., Haller, P. & Odersky, M. (2014) Spores: A type-based foundation for closures in the age of concurrency and distribution. In Proceedings of the European Conference on Object-Oriented Programming. Berlin, Heidelberg, Germany: Springer-Verlag, pp. 308–333.Google Scholar

Milner, R., Parrow, J. & Walker, D. (1992) A calculus of mobile processes. Inf. Comput. 100(1), 1–77.Google Scholar

Murphy, T. VII, Crary, K. & Harper, R. (2007) Type-safe distributed programming with ML5. In Proceedings of the International Symposium on Trustworthy Global Computing. Berlin, Heidelberg, Germany: Springer-Verlag, pp. 108–123.Google Scholar

Murray, D. G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A. & Hand, S. (2011) CIEL: A universal execution engine for distributed data-flow computing. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation, Andersen, D. G. & Ratnasamy, S. (eds). Berkeley, CA, USA: USENIX Association.Google Scholar

NICTA. (2015) Scoobi. Available at: https://github.com/nicta/scoobi, accessed January 30, 2018.Google Scholar

Odersky, M. & Zenger, M. (2005) Scalable component abstractions. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, Johnson, R. E. & Gabriel, R. P. (eds). New York, NY, USA: ACM, pp. 41–57.Google Scholar

Odersky, M., Spoon, L. & Venners, B. (2010) Programming in Scala, 2nd edn. Walnut Creek, CA, USA: Artima.Google Scholar

Peyton Jones, S., Gordon, A. & Finne, S. (1996) Concurrent Haskell. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. New York, NY, USA: ACM, pp. 295–308.Google Scholar

Pierce, B. C. (2002) Types and Programming Languages. Cambridge, MA, USA: MIT Press.Google Scholar

Rossberg, A., Le Botlan, D., Tack, G., Brunklaus, T. & Smolka, G. (2004) Alice through the looking glass. Trends Funct. Program. 5, 79–96.Google Scholar

Sewell, P., Leifer, J. J., Wansbrough, K., Nardelli, F. Z., Allen-Williams, M., Habouzit, P. & Vafeiadis, V. (2005) Acute: High-level programming language design for distributed computation. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming. New York, NY, USA: ACM, pp. 15–26.CrossRef Google Scholar

Shapiro, M., Preguiça, N. M., Baquero, C. & Zawirski, M. (2011) Conflict-free replicated data types. In SSS, Défago, X., Petit, F. & Villain, V. (eds), Lecture Notes in Computer Science, vol. 6976. Berlin, Heidelberg, Germany: Springer, pp. 386–400.Google Scholar

Twitter. (2015) Scalding. Available at: https://github.com/twitter/scalding, accessed January 30, 2018.Google Scholar

Typesafe. (2015) Akka. Available at: http://akka.io/, accessed January 30, 2018.Google Scholar

Waldo, J., Wyant, G., Wollrath, A. & Kendall, S. C. (1996) A note on distributed computing. In Proceedings of the International Workshop on Mobile Object Systems, Vitek, J., & Tschudin, C. (eds). Berlin, Heidelberg, Germany: Springer-Verlag, pp. 49–64.Google Scholar

Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P. K. & Currey, J. (2008) DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, Draves, Richard, & van Renesse, Robbert (eds). Berkeley, CA, USA: USENIX Association, pp. 1–14.Google Scholar

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. (2010) Spark: Cluster computing with working sets. In Proceedings of the USENIX Workshop on Hot Topics in Cloud Computing. HotCloud'10. Berkeley, CA, USA: USENIX Association, pp. 10–10.Google Scholar

Submit a response

Discussions

No Discussions have been published for this article.

Article contents

A programming model and foundation for lineage-based distributed computation

Abstract

Information

References

Discussions

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests