Skip to main content
×
Home

Computation semantics of the functional scientific workflow language Cuneiform*

  • JÖRGEN BRANDT (a1), WOLFGANG REISIG (a1) and ULF LESER (a1)
Abstract
Abstract

Cuneiform is a minimal functional programming language for large-scale scientific data analysis. Implementing a strict black-box view on external operators and data, it allows the direct embedding of code in a variety of external languages like Python or R, provides data-parallel higher order operators for processing large partitioned data sets, allows conditionals and general recursion, and has a naturally parallelizable evaluation strategy suitable for multi-core servers and distributed execution environments like Hadoop, HTCondor, or distributed Erlang. Cuneiform has been applied in several data-intensive research areas including remote sensing, machine learning, and bioinformatics, all of which critically depend on the flexible assembly of pre-existing tools and libraries written in different languages into complex pipelines. This paper introduces the computation semantics for Cuneiform. It presents Cuneiform's abstract syntax, a simple type system, and the semantics of evaluation. Providing an unambiguous specification of the behavior of Cuneiform eases the implementation of interpreters which we showcase by providing a concise reference implementation in Erlang. The similarity of Cuneiform's syntax to the simply typed lambda calculus puts Cuneiform in perspective and allows a straightforward discussion of its design in the context of functional programming. Moreover, the simple type system allows the deduction of the language's safety up to black-box operators. Last, the formulation of the semantics also permits the verification of compilers to and from other workflow languages.

Copyright
Footnotes
Hide All
*

This work is funded by the EU FP7 project “Scalable, Secure Storage and Analysis of Biobank Data” under Grant Agreement no. 317871. We also acknowledge funding by the Humboldt Graduate School GRK 1651: SOAMED.

Footnotes
References
Hide All
Armstrong J., Virding R., Wikström C. & Williams M. (1996) Concurrent Programming in ERLANG (2nd Ed.). Prentice Hall International (UK) Ltd., Hertfordshire, UK.
Arts T., Hughes J., Johansson J. & Wiger U. (2006) Testing telecoms software with quviq quickcheck. In Proceedings of the 2006 ACM SIGPLAN Workshop on Erlang, ERLANG '06. New York, NY, USA: ACM.
Bessani A., Brandt J., Bux M., Cogo V., Dimitrova L., Dowling J., Gholami A., Hakimzadeh K., Hummel M., Ismail M., Laure E., Leser U., Litton J.-E., Martinez R., Niazi S., Reichel J. & Zimmermann K. (2015) Biobankcloud: A platform for the secure storage, sharing, and processing of large biomedical data sets. In Proceedings of 1st International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH 2015).
Bishop C. M. (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
Brandt J., Bux M. & Leser U. (2015 March) Cuneiform: A functional language for large scale scientific data analysis. In Proceedings of the Workshops of the EDBT/ICDT, vol. 1330, pp. 17–26.
Breitinger S., Klusik U. & Loogen R. (1998) From (Sequential) Haskell to (Parallel) Eden: An Implementation Point of View. Berlin, Heidelberg: Springer, pp. 318334.
Budiu M. & Goldstein S. C. (2002) Pegasus: An Efficient Intermediate Representation. Technical Report. DTIC Document.
Bux M., Brandt J., Lipka C., Hakimzadeh K., Dowling J. & Leser U. (2015 September) Saasfee: Scalable scientific workflow execution engine. In Proceedings of the VLDB Endowment, vol. 8, pp. 1892–1895.
Bux M., Brandt J., Witt C., Dowling J. & Leser U. (2017) Hi-way: Execution of scientific workflows on hadoop yarn. In Proceedings of the 20th International Conference on Extending Database Technology (EDBT).
Church A. & Rosser J. B. (1936) Some properties of conversion. Trans. Am. Math. Soc. 39 (3), 472482.
Cohen-Boulakia S. & Leser U. (2011) Search, adapt, and reuse: The future of scientific workflows. Sigmod Rec. 40 (2), 616.
Dean J. & Ghemawat S. (2008) Mapreduce: Simplified data processing on large clusters. Commun. ACM 51 (1), 107113.
Deelman E., Livny M., Mehta G., Pavlo A., Singh G., Su M.-H., Vahi K. & Wenger R. K. (2006) Pegasus and dagman from concept to execution: Mapping scientific workflows onto today's cyberinfrastructure. In High Performance Computing Workshop, pp. 56–74.
DeRemer F. L. & Kron H. H. (1976) Programming-in-the-Large versus Programming-in-the-Small. Berlin, Heidelberg: Springer, pp. 8089.
Di Tommaso Paolo, Maria Chatzou, Floden Evan W., Prieto Barja Pablo, Emilio Palumbo & Cedric Notredame (2017). Nextflow enables reproducible computational workflows. Nat Biotech, 35 (4), 316319.
Duda R. O., Hart P. E. & Stork D. G. (2012) Pattern Classification. John Wiley & Sons.
Efron B. & Tibshirani R. J. (1994) An Introduction to the Bootstrap. CRC Press.
Goderis A., Brooks C., Altintas I., Lee E. A. & Goble C. (2007) Composing Different Models of Computation in Kepler and Ptolemy ii. Berlin, Heidelberg: Springer, pp. 182190.
Goecks J., Nekrutenko A. & Taylor J. (2010) Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11 (8), 1.
Guan Z., Hernandez F., Bangalore P., Gray J., Skjellum A., Velusamy V. & Liu Y. (2006) Grid-flow: A grid-enabled scientific workflow system with a petri-net-based interface. Concurr. Comput.: Pract. Exp. 18 (10), 11151140.
Harper R. (2016) Practical Foundations for Programming Languages. Cambridge University Press.
Haykin S. S., Haykin S. S., Haykin S. S. & Haykin S. S. (2009) Neural Networks and Learning Machines, vol. 3. Upper Saddle River, NJ, USA: Pearson.
Hennessy M. (1990) The Semantics of Programming Languages: An Elementary Introduction using Structural Operational Semantics. John Wiley & Sons.
Hey T., et al. (2009) The Fourth Paradigm: Data-Intensive Scientific Discovery, vol. 1. Microsoft research Redmond, WA.
Hidders J. & Sroka J. (2008) Towards a Calculus for Collection-Oriented Scientific Workflows with Side Effects. Berlin, Heidelberg: Springer, pp. 374391.
Hughes J. (2007) Quickcheck Testing for Fun and Profit. Berlin, Heidelberg: Springer, pp. 132.
Hull D., Wolstencroft K., Stevens R., Goble C., Pocock M. R., Li P. & Oinn T. (2006) Taverna: A tool for building and running workflows of services. Nucleic Acids Res. 34 (suppl 2), W729W732.
Kahn G. (1987) Natural Semantics. Berlin, Heidelberg: Springer, pp. 2239.
Kalayci S., Dasgupta G., Fong L., Ezenwoye O. & Sadjadi S. M. (2010) Distributed and adaptive execution of condor dagman workflows. In SEKE, pp. 587–590.
Kelly P. M. (2011) Applying functional programming theory to the design of workflow engines. PhD thesis, University of Adelaide.
Kelly P. M., Coddington P. D. & Wendelborn A. L. (2009) Lambda calculus as a workflow model. Concurr. Comput.: Pract. Exp. 21 (16), 19992017.
Köster J. & Rahmann S. (2012) SnakemakeâǍŤa scalable bioinformatics workflow engine. Bioinformatics 28 (19), 25202522.
Liu J., Pacitti E., Valduriez P. & Mattoso M. (2015) A survey of data-intensive scientific workflow management. J. Grid Comput. 13 (4), 457493.
Loogen R., Ortega-Mallén Y. & Peña-Marí R. (2005) Parallel functional programming in eden. J. Funct. Program. 15 (03), 431475.
Ludäscher B. & Altintas I. (2003) On providing declarative design and programming constructs for scientific workflows based on process networks. San Diego Supercomputer Center.
Manly B. F. J. (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology, vol. 70. CRC Press.
McPhillips T., Bowers S. & Ludäscher B. (2006) Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data. Berlin, Heidelberg: Springer, pp. 248263.
Michaelson G. (2011) An Introduction to Functional Programming Through Lambda Calculus. Courier Corporation.
Moggi E. (1991) Notions of computation and monads. Inform. Comput. 93 (1), 5592.
Myers K. S., Yan H., Ong I. M., Chung D., Liang K., Tran F, Keleş S., Landick R. & Kiley P. J. (2013) Genome-scale analysis of escherichia coli fnr reveals complex features of transcription factor binding. Plos Genet 9 (6), e1003565.
Oinn T., Greenwood M., Addis M., Alpdemir M. N., Ferris J., Glover K., Goble C., Goderis A., Hull D., Marvin D., Li P., Lord P., Pocock M. R., Senger M., Stevens R., Wipat A. & Wroe C. (2006) Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, 18 (10), 10671100.
Olston C., Reed B., Srivastava U., Kumar R. & Tomkins A. (2008) Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD '08. New York, NY, USA: ACM, pp. 1099–1110.
Pierce B. C. (2002) Types and Programming Languages. MIT press.
Plotkin G. D. (1981) A structural approach to operational semantics. Computer Science Department, Aarhus University Aarhus, Denmark.
Pointon R. F., Trinder P. W. & Loidl H.-W. (2001) The Design and Implementation of Glasgow Distributed Haskell. Berlin, Heidelberg: Springer, pp. 5370.
Sroka J. & Hidders J. (2009a) Towards a formal semantics for the process model of the taverna workbench. Part i. Fundam. Inform. 92 (3), 279299.
Sroka J. & Hidders J. (2009b) Towards a formal semantics for the process model of the taverna workbench. Part ii. Fundam. Inform. 92 (4), 373396.
Sroka J., Hidders J., Missier P. & Goble C. (2010) A formal semantics for the taverna 2 workflow model. J. Comput. Syst. Sci. 76 (6), 490508.
Tennent R. D. (1976) The denotational semantics of programming languages. Commun. ACM 19 (8), 437453.
Thusoo A., Sarma J. S., Jain N., Shao Z., Chakka P., Anthony S., Liu H., Wyckoff P. & Murthy R. (2009) Hive: A warehousing solution over a map-reduce framework. Proc. Vldb Endowment 2 (2), 16261629.
Turi D., Missier P., Goble C., De Roure D. & Oinn T. (2007) Taverna workflows: Syntax and semantics. In Proceedings of IEEE International Conference on e-Science and Grid Computing. IEEE, pp. 441–448.
White T. (2012) Hadoop: The Definitive Guide. O'Reilly Media, Inc..
Winskel G. (1993) The Formal Semantics of Programming Languages: An Introduction. MIT Press.
Zaharia M., Chowdhury M., Franklin M. J., Shenker S. & Stoica I. (2010) Spark: Cluster computing with working sets. Hotcloud 10 (10–10), 95.
Zaharia M., Chowdhury M., Das T., Dave A., Ma J., Mccauley M., Franklin M., Shenker S. & Stoica I. (2012) Fast and interactive analytics over hadoop data with spark. Usenix Login 37 (4), 4551.
Zinn D., Bowers S., McPhillips T. & Ludäscher B. (2009) Scientific workflow design with data assembly lines. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS '09. New York, NY, USA: ACM, pp. 14:1–14:10.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Journal of Functional Programming
  • ISSN: 0956-7968
  • EISSN: 1469-7653
  • URL: /core/journals/journal-of-functional-programming
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 15 *
Loading metrics...

Abstract views

Total abstract views: 165 *
Loading metrics...

* Views captured on Cambridge Core between 24th October 2017 - 23rd November 2017. This data will be updated every 24 hours.