Skip to main content
×
×
Home

An algebra for distributed Big Data analytics

  • LEONIDAS FEGARAS (a1)
Abstract

We present an algebra for data-intensive scalable computing based on monoid homomorphisms that consists of a small set of operations that capture most features supported by current domain-specific languages for data-centric distributed computing. This algebra is being used as the formal basis of MRQL, which is a query processing and optimization system for large-scale distributed data analysis. The MRQL semantics is given in terms of monoid comprehensions, which support group-by and order-by syntax and can work on heterogeneous collections without requiring any extension to the monoid algebra. We present the syntax and semantics of monoid comprehensions and provide rules to translate them to the monoid algebra. We give evidence of the effectiveness of our algebra by presenting some important optimization rules, such as converting nested queries to joins.

Copyright
References
Hide All
Acar, U. A., Blelloch, G. E., Blume, M., Harper, R. & Tangwongsan, K. (2009) An experimental analysis of self-adjusting computation. ACM Trans. Program. Lang. Syst. (TOPLAS) 32 (1), 3:1–53.
Aït-Kaci, H. (2013) An abstract, reusable & extensible programming language design architecture. In In Search of Elegance in the Theory and Practice of Computation, Springer 2013, LNCS 8000, pp. 112–166. Available at http://hassan-ait-kaci.net/pdf/hak-opb.pdf.
Alexandrov, A., Katsifodimos, A., Krastev, G. & Markl, V. (2016) Implicit parallelism through deep language embedding. SIGMOD Record 45 (1), 5158.
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A. & Zaharia, M. (2015) Spark SQL: Relational data processing in Spark. In International Conference on Management of Data (SIGMOD). pp. 1383–1394.
Apache Flink. (2017) Available at: http://flink.apache.org/, accessed January 2, 2017.
Apache Giraph. (2017) Available at: http://giraph.apache.org/, accessed January 2, 2017.
Apache Hadoop. (2017) Available at: http://hadoop.apache.org/, accessed January 2, 2017.
Apache Hama. (2017) Available at: http://hama.apache.org/, accessed January 2, 2017.
Apache Hive. (2017) Available at: http://hive.apache.org/, accessed January 2, 2017.
Apache MRQL (incubating). (2017) Available at: http://mrql.incubator.apache.org/ The MRQL syntax is described at http://wiki.apache.org/mrql/LanguageDescription, accessed January 2, 2017.
Apache Spark. (2017) Available at: http://spark.apache.org/, accessed January 2, 2017.
Backhouse, R. & Hoogendijk, P. (1993) Elements of a relational theory of datatypes. In Formal Program Development, IFIP TC2/WG 2.1 State of the Art Seminar, Springer-Verlag 1993, LNCS, vol. 755, pp. 742.
Bancilhon, F., Briggs, T., Khoshafian, S. & Valduriez, P. (1987) FAD, a powerful and simple database language. In Proceedings of the International Conference on Very Large Data Bases. pp. 97–105.
Battre, D., Ewen, S., Hueske, F., Kao, O., Markl, V. & Warneke, D. (2010) Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In 1st ACM Symposium on Cloud Computing (SOCC'10). pp. 119–130.
Blelloch, G. (1993) NESL: A nested data-parallel language. Technical report, Carnegie Mellon University. CMU-CS-93-129.
Blelloch, G. & Sabot, G. (1990) Compiling collection-oriented languages onto massively parallel computers. J. Parallel Distrib. Comput. 8 (2), 119134.
Boykin, O., Ritchie, S., O'Connell, I. & Lin, J. (2014) Summingbird: A framework for integrating batch and online MapReduce computations. Proc. VLDB Endowment (PVLDB) 7 (13), 14411451.
Bryant, R. E. (2011) Data-intensive scalable computing for scientific applications. Comput. Sci. Eng. 13 (6), 2533.
Buneman, P., Libkin, L., Suciu, D., Tannen, V. & Wong, L. (1994) Comprehension syntax. SIGMOD Record 23 (11), 8796.
Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S. & Zhou, J. (2008) SCOPE: Easy and efficient parallel processing of massive data Sets. Proc. VLDB Endowment (PVLDB) 1 (2), 12651276.
Chakrabarti, D., Zhan, Y. & Faloutsos, C. (2004) R-MAT: A recursive model for graph mining. In SIAM International Conference on Data Mining (SDM). pp. 442–446.
Dean, J. & Ghemawat, S. (2004) MapReduce: Simplified data processing on large clusters. In Symposium on Operating System Design and Implementation (OSDI).
Fegaras, L. (2012) Supporting bulk synchronous parallelism in map-reduce queries. In International Workshop on Data Intensive Computing in the Clouds (DataCloud).
Fegaras, L. (2016) Incremental query processing on big data streams. IEEE Trans. Knowl. Data Eng. 28 (11), 29983012. Available at: https://lambda.uta.edu/tkde16-preprint.pdf.
Fegaras, L., Li, C., Gupta, U. & Philip, J. J. (2011) XML query optimization in map-reduce. In International Workshop on the Web and Databases (WebDB).
Fegaras, L., Li, C. & Gupta, U. (2012) An optimization framework for map-reduce queries. In International Conference on Extending Database Technology (EDBT). pp. 26–37.
Fegaras, L. & Maier, D. (1995) Towards an effective calculus for object query languages. In International Conference on Management of Data (SIGMOD). pp. 47–58.
Fegaras, L. & Maier, D. (2000) Optimizing object queries using an effective calculus. ACM Trans. Database Syst. (TODS) 25 (4), 457516. Available at: https://lambda.uta.edu/tods00.pdf.
Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., Reed, B., Srinivasan, S. & Srivastava, U. (2009) Building a high-level dataflow system on top of map-reduce: the pig Experience. Proc. VLDB Endowment (PVLDB) 2 (2), 14141425.
Gibbons, J. (1996) The third homomorphism theorem. J. Funct. Program. 6 (4), 657665.
Gibbons, J. (2016) Comprehending ringads: For Phil Wadler on the occasion of his 60th birthday. In A List of Successes That Can Change the World. Springer 2016, LNCS, vol. 9600, pp. 132151.
Giorgidze, G., Grust, T., Schweinsberg, N. & Weijers, J. (2011) Bringing back monad comprehensions. In Haskell Symposium, pp. 13–22.
Grust, T. & Scholl, M. H. (1999) How to comprehend queries functionally. J. Intell. Inform. Syst. 12 (2–3), 191218.
Holsch, J., Grossniklaus, M. & Scholl, M. H. (2016) Optimization of nested queries using the NF2 algebra. In ACM SIGMOD International Conference on Management of Data. pp. 1765–1780.
Isard, M. & Yu, Y. (2009) Distributed data-parallel computing using a high-level programming language. In ACM SIGMOD International Conference on Management of Data. pp. 987–994.
Lin, J. & Dyer, C. (2010) Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers.
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C. & Hellerstein, J. M. (2012) Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endowment (PVLDB) 5 (8), 716727.
Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N. & Czajkowski, G. (2010) Pregel: A system for large-scale graph processing. In ACM SIGMOD International Conference on Management of Data. pp. 135–146.
Olston, C., Reed, B., Srivastava, U., Kumar, R. & Tomkins, A. (2008) Pig Latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data. pp. 1099–1110.
Power, R. & Li, J. (2010) Piccolo: Building fast, distributed programs with partitioned tables. In Symposium on Operating System Design and Implementation (OSDI).
Shinnar, A., Cunningham, D., Herta, B. & Saraswat, V. (2012) M3R: Increased performance for in-memory hadoop jobs. Proc. VLDB Endowment (PVLDB) 5 (12), 17361747.
Steele, G. L. Jr. (2009) Organizing functional code for parallel execution or, foldl and foldr considered slightly harmful. In ICFP. pp. 1–2.
Tannen, V. B., Buneman, P. & Naqvi, S. (1991) Structural recursion as a query language. In International Workshop on Database Programming Languages: Bulk Types and Persistent Data (DBPL). pp. 9–19.
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Antony, S., Liu, H., Wyckoff, P. & Murthy, R. (2009) Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endowment (PVLDB) 2 (2), 16261629.
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H. & Murthy, R. (2010) Hive: A petabyte scale data warehouse using hadoop. In IEEE International Conference on Data Engineering (ICDE). pp. 996–1005.
Trinder, P. & Wadler, P. (1989) Improving list comprehension database queries. In TENCON. pp. 186–192.
Trinder, P. W. (1991) Comprehensions, a query notation for DBPLs. In International Workshop on Database Programming Languages (DBPL). pp. 55–68.
Valiant, L. G. (1990) A bridging model for parallel computation. Commun. ACM (CACM) 33 (8), 103111.
Wadler, P. (1990) Comprehending monads. In ACM Symposium on Lisp and Functional Programming. pp. 61–78.
Wadler, P. (1987) List comprehensions. In The Implementation of Functional Programming Languages, Peyton Jones, S. (ed.). Prentice Hall, Chapter 7.
Wadler, P. & Peyton Jones, S. (2007) Comprehensive comprehensions (comprehensions with ‘Order by’ and ‘Group by’). In Haskell Symposium. pp. 61–72.
Wong, L. (2000) Kleisli, a functional query system. J. Funct. Program. 10 (1), 1956.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. & Stoica, I. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium on Networked Systems Design and Implementation (NSDI).
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S. & Stoica, I. (2013) Discretized streams: Fault-tolerant streaming computation at scale. In Symposium on Operating Systems Principles (SOSP).
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Journal of Functional Programming
  • ISSN: 0956-7968
  • EISSN: 1469-7653
  • URL: /core/journals/journal-of-functional-programming
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 50 *
Loading metrics...

Abstract views

Total abstract views: 747 *
Loading metrics...

* Views captured on Cambridge Core between 11th December 2017 - 19th June 2018. This data will be updated every 24 hours.

An algebra for distributed Big Data analytics

  • LEONIDAS FEGARAS (a1)
Submit a response

Discussions

No Discussions have been published for this article.

×

Reply to: Submit a response


Your details


Conflicting interests

Do you have any conflicting interests? *