Hostname: page-component-8448b6f56d-t5pn6 Total loading time: 0 Render date: 2024-04-19T23:42:19.768Z Has data issue: false hasContentIssue false

A programming model and foundation for lineage-based distributed computation

Published online by Cambridge University Press:  12 March 2018

PHILIPP HALLER
Affiliation:
School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden (e-mail: phaller@kth.se)
HEATHER MILLER
Affiliation:
School of Computer and Communication Sciences, EPFL, CH-1015 Lausanne, Switzerland College of Computer and Information Science, Northeastern University, Boston, MA-02115, USA (e-mail: heather.miller@epfl.ch)
NORMEN MÜLLER
Affiliation:
Safeplace, DE-40667 Meerbusch, Germany
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

The most successful systems for “big data” processing have all adopted functional APIs. We present a new programming model, we call function passing, designed to provide a more principled substrate, or middleware, upon which to build data-centric distributed systems like Spark. A key idea is to build up a persistent functional data structure representing transformations on distributed immutable data by passing well-typed serializable functions over the wire and applying them to this distributed data. Thus, the function passing model can be thought of as a persistent functional data structure that is distributed, where transformations performed on distributed data are stored in its nodes rather than the distributed data itself. One advantage of this model is that failure recovery is simplified by design – data can be recovered by replaying function applications atop immutable data loaded from stable storage. Deferred evaluation is also central to our model; by incorporating deferred evaluation into our design only at the point of initiating network communication, the function passing model remains easy to reason about while remaining efficient in time and memory. Moreover, we provide a complete formalization of the programming model in order to study the foundations of lineage-based distributed computation. In particular, we develop a theory of safe, mobile lineages based on a subject reduction theorem for a typed core language. Furthermore, we formalize a progress theorem that guarantees the finite materialization of remote, lineage-based data. Thus, the formal model may serve as a basis for further developments of the theory of data-centric distributed programming, including aspects such as fault tolerance. We provide an open-source implementation of our model in and for the Scala programming language, along with a case study of several example frameworks and end-user programs written atop this model.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2018 

References

Agha, G. (1986) ACTORS: A Model of Concurrent Computation in Distributed Systems. Cambridge, MA, USA: MIT Press.CrossRefGoogle Scholar
Agha, G. A., Mason, I. A., Smith, S. F. & Talcott, C. L. (1997) A foundation for actor computation. J. Funct. Prog. 7(1), 172.Google Scholar
Apache. (2015) Hadoop. Available at: http://hadoop.apache.org/, accessed January 30, 2018.Google Scholar
Billings, J., Sewell, P., Shinwell, M. & Strniša, R. (2006) Type-safe distributed programming for OCaml. In Proceedings of the 2006 Workshop on ML. New York, NY, USA: ACM, pp. 20–31.CrossRefGoogle Scholar
Chambers, C., Raniwala, A., Perry, F., Adams, S, Henry, R. R., Bradshaw, R. & Weizenbaum, N. (2010) FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. New York NY, USA: ACM. pp. 363–375.Google Scholar
Dean, J. & Ghemawat, S. (2008) MapReduce: Simplified data processing on large clusters. Commun. ACM 51 (1), 107113.Google Scholar
Dzik, J., Palladinos, N., Rontogiannis, K., Tsarpalis, E. & Vathis, N. (2013) MBrace: Cloud computing with monads. In PLOS@SOSP, Harris, T. & Madhavapeddy, A. (eds). New York, NY, USA: ACM.Google Scholar
Elsman, M. (2005) Type-specialized serialization with sharing. In Proceedings of the Symposium on Trends in Functional Programming, pp. 47–62.Google Scholar
Epstein, J., Black, A. P. & Jones, S. L. P. (2011) Towards Haskell in the cloud. In Proceedings of the Haskell Symposium, pp. 118–129.Google Scholar
Germain, G. (2006) Concurrency oriented programming in Termite Scheme. In Proceedings of the 2006 ACM SIGPLAN workshop on Erlang, p. 20.Google Scholar
Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y. & Zhuang, L. (2010) Nectar: Automatic management of data and computation in datacenters. In OSDI, Arpaci-Dusseau, R. H. & Chen, B. (eds). Berkeley, CA, USA: USENIX Association, pp. 7588.Google Scholar
Haller, P. & Loiko, A. (2016) LaCasa: Lightweight affinity and object capabilities in Scala. In OOPSLA, Visser, E. & Smaragdakis, Y. (eds). New York, NY, USA: ACM, pp. 272291.Google Scholar
Haller, P. & Odersky, M. (2009) Scala actors: Unifying thread-based and event-based programming. Theor. Comput. Sci. 410(2), 202220.CrossRefGoogle Scholar
Haller, P. & Odersky, M. (2010) Capabilities for uniqueness and borrowing. In Proceedings of the European Conference on Object-Oriented Programming, Maribor, Slovenia, June 21–25, 2010, pp. 354–378.Google Scholar
Haller, P., Prokopec, A., Miller, H., Klang, V., Kuhn, R. & Jovanovic, V. (2012) Futures and promises. Available at: http://docs.scala-lang.org/overviews/core/futures.html, accessed January 30, 2018.Google Scholar
He, J., Wadler, P. & Trinder, P. (2014) Typecasting actors: From Akka to TAkka. In Proceedings of the 5th Scala Workshop. New York, NY, USA: ACM, pp. 23–33.Google Scholar
Herhut, S., Hudson, R. L., Shpeisman, T. & Sreeram, J. (2013) River Trail: A path to parallelism in JavaScript. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. New York, NY, USA: ACM, pp. 729–744.CrossRefGoogle Scholar
Hickey, R. (2008) The Clojure programming language. In Proceedings of the Dynamic Languages Symposium. New York, NY, USA: ACM, p. 1.Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A. & Fetterly, D. (2007) Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems. New York, NY, USA: ACM, pp. 59–72.CrossRefGoogle Scholar
Kennedy, A. (2004) Pickler combinators. J. Funct. Program. 14 (6), 727739.CrossRefGoogle Scholar
Matsakis, N. D. (2012) Parallel closures: A new twist on an old idea. In Proceedings of the 4th USENIX Workshop on Hot Topics in Parallelism, Boehm, H.-J. & Ceze, L. (eds), HotPar. Berkeley, CA, USA: USENIX Association, p. 5.Google Scholar
Miller, H., Haller, P., Burmako, E. & Odersky, M. (2013) Instant pickles: Generating object-oriented pickler combinators for fast and extensible serialization. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. New York, NY, USA: ACM, pp. 183–202.CrossRefGoogle Scholar
Miller, H., Haller, P. & Odersky, M. (2014) Spores: A type-based foundation for closures in the age of concurrency and distribution. In Proceedings of the European Conference on Object-Oriented Programming. Berlin, Heidelberg, Germany: Springer-Verlag, pp. 308–333.Google Scholar
Milner, R., Parrow, J. & Walker, D. (1992) A calculus of mobile processes. Inf. Comput. 100(1), 177.Google Scholar
Murphy, T. VII, Crary, K. & Harper, R. (2007) Type-safe distributed programming with ML5. In Proceedings of the International Symposium on Trustworthy Global Computing. Berlin, Heidelberg, Germany: Springer-Verlag, pp. 108–123.Google Scholar
Murray, D. G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A. & Hand, S. (2011) CIEL: A universal execution engine for distributed data-flow computing. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation, Andersen, D. G. & Ratnasamy, S. (eds). Berkeley, CA, USA: USENIX Association.Google Scholar
NICTA. (2015) Scoobi. Available at: https://github.com/nicta/scoobi, accessed January 30, 2018.Google Scholar
Odersky, M. & Zenger, M. (2005) Scalable component abstractions. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, Johnson, R. E. & Gabriel, R. P. (eds). New York, NY, USA: ACM, pp. 41–57.Google Scholar
Odersky, M., Spoon, L. & Venners, B. (2010) Programming in Scala, 2nd edn. Walnut Creek, CA, USA: Artima.Google Scholar
Peyton Jones, S., Gordon, A. & Finne, S. (1996) Concurrent Haskell. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. New York, NY, USA: ACM, pp. 295–308.Google Scholar
Pierce, B. C. (2002) Types and Programming Languages. Cambridge, MA, USA: MIT Press.Google Scholar
Rossberg, A., Le Botlan, D., Tack, G., Brunklaus, T. & Smolka, G. (2004) Alice through the looking glass. Trends Funct. Program. 5, 7996.Google Scholar
Sewell, P., Leifer, J. J., Wansbrough, K., Nardelli, F. Z., Allen-Williams, M., Habouzit, P. & Vafeiadis, V. (2005) Acute: High-level programming language design for distributed computation. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming. New York, NY, USA: ACM, pp. 15–26.CrossRefGoogle Scholar
Shapiro, M., Preguiça, N. M., Baquero, C. & Zawirski, M. (2011) Conflict-free replicated data types. In SSS, Défago, X., Petit, F. & Villain, V. (eds), Lecture Notes in Computer Science, vol. 6976. Berlin, Heidelberg, Germany: Springer, pp. 386400.Google Scholar
Twitter. (2015) Scalding. Available at: https://github.com/twitter/scalding, accessed January 30, 2018.Google Scholar
Typesafe. (2015) Akka. Available at: http://akka.io/, accessed January 30, 2018.Google Scholar
Waldo, J., Wyant, G., Wollrath, A. & Kendall, S. C. (1996) A note on distributed computing. In Proceedings of the International Workshop on Mobile Object Systems, Vitek, J., & Tschudin, C. (eds). Berlin, Heidelberg, Germany: Springer-Verlag, pp. 49–64.Google Scholar
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P. K. & Currey, J. (2008) DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, Draves, Richard, & van Renesse, Robbert (eds). Berkeley, CA, USA: USENIX Association, pp. 1–14.Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. (2010) Spark: Cluster computing with working sets. In Proceedings of the USENIX Workshop on Hot Topics in Cloud Computing. HotCloud'10. Berkeley, CA, USA: USENIX Association, pp. 10–10.Google Scholar
Submit a response

Discussions

No Discussions have been published for this article.