Hostname: page-component-8448b6f56d-42gr6 Total loading time: 0 Render date: 2024-04-19T09:40:20.027Z Has data issue: false hasContentIssue false

An algebra for distributed Big Data analytics

Published online by Cambridge University Press:  11 December 2017

LEONIDAS FEGARAS*
Affiliation:
Department of Computer Science and Engineering, University of Texas at Arlington, Arlington TX 76019, USA (e-mail: fegaras@cse.uta.edu)
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

We present an algebra for data-intensive scalable computing based on monoid homomorphisms that consists of a small set of operations that capture most features supported by current domain-specific languages for data-centric distributed computing. This algebra is being used as the formal basis of MRQL, which is a query processing and optimization system for large-scale distributed data analysis. The MRQL semantics is given in terms of monoid comprehensions, which support group-by and order-by syntax and can work on heterogeneous collections without requiring any extension to the monoid algebra. We present the syntax and semantics of monoid comprehensions and provide rules to translate them to the monoid algebra. We give evidence of the effectiveness of our algebra by presenting some important optimization rules, such as converting nested queries to joins.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2017 

References

Acar, U. A., Blelloch, G. E., Blume, M., Harper, R. & Tangwongsan, K. (2009) An experimental analysis of self-adjusting computation. ACM Trans. Program. Lang. Syst. (TOPLAS) 32 (1), 3:1–53.Google Scholar
Aït-Kaci, H. (2013) An abstract, reusable & extensible programming language design architecture. In In Search of Elegance in the Theory and Practice of Computation, Springer 2013, LNCS 8000, pp. 112–166. Available at http://hassan-ait-kaci.net/pdf/hak-opb.pdf.Google Scholar
Alexandrov, A., Katsifodimos, A., Krastev, G. & Markl, V. (2016) Implicit parallelism through deep language embedding. SIGMOD Record 45 (1), 5158.Google Scholar
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A. & Zaharia, M. (2015) Spark SQL: Relational data processing in Spark. In International Conference on Management of Data (SIGMOD). pp. 1383–1394.Google Scholar
Apache Flink. (2017) Available at: http://flink.apache.org/, accessed January 2, 2017.Google Scholar
Apache Giraph. (2017) Available at: http://giraph.apache.org/, accessed January 2, 2017.Google Scholar
Apache Hadoop. (2017) Available at: http://hadoop.apache.org/, accessed January 2, 2017.Google Scholar
Apache Hama. (2017) Available at: http://hama.apache.org/, accessed January 2, 2017.Google Scholar
Apache Hive. (2017) Available at: http://hive.apache.org/, accessed January 2, 2017.Google Scholar
Apache MRQL (incubating). (2017) Available at: http://mrql.incubator.apache.org/ The MRQL syntax is described at http://wiki.apache.org/mrql/LanguageDescription, accessed January 2, 2017.Google Scholar
Apache Spark. (2017) Available at: http://spark.apache.org/, accessed January 2, 2017.Google Scholar
Backhouse, R. & Hoogendijk, P. (1993) Elements of a relational theory of datatypes. In Formal Program Development, IFIP TC2/WG 2.1 State of the Art Seminar, Springer-Verlag 1993, LNCS, vol. 755, pp. 742.Google Scholar
Bancilhon, F., Briggs, T., Khoshafian, S. & Valduriez, P. (1987) FAD, a powerful and simple database language. In Proceedings of the International Conference on Very Large Data Bases. pp. 97–105.Google Scholar
Battre, D., Ewen, S., Hueske, F., Kao, O., Markl, V. & Warneke, D. (2010) Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In 1st ACM Symposium on Cloud Computing (SOCC'10). pp. 119–130.Google Scholar
Blelloch, G. (1993) NESL: A nested data-parallel language. Technical report, Carnegie Mellon University. CMU-CS-93-129.Google Scholar
Blelloch, G. & Sabot, G. (1990) Compiling collection-oriented languages onto massively parallel computers. J. Parallel Distrib. Comput. 8 (2), 119134.Google Scholar
Boykin, O., Ritchie, S., O'Connell, I. & Lin, J. (2014) Summingbird: A framework for integrating batch and online MapReduce computations. Proc. VLDB Endowment (PVLDB) 7 (13), 14411451.Google Scholar
Bryant, R. E. (2011) Data-intensive scalable computing for scientific applications. Comput. Sci. Eng. 13 (6), 2533.Google Scholar
Buneman, P., Libkin, L., Suciu, D., Tannen, V. & Wong, L. (1994) Comprehension syntax. SIGMOD Record 23 (11), 8796.Google Scholar
Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S. & Zhou, J. (2008) SCOPE: Easy and efficient parallel processing of massive data Sets. Proc. VLDB Endowment (PVLDB) 1 (2), 12651276.Google Scholar
Chakrabarti, D., Zhan, Y. & Faloutsos, C. (2004) R-MAT: A recursive model for graph mining. In SIAM International Conference on Data Mining (SDM). pp. 442–446.Google Scholar
Dean, J. & Ghemawat, S. (2004) MapReduce: Simplified data processing on large clusters. In Symposium on Operating System Design and Implementation (OSDI).Google Scholar
Fegaras, L. (2012) Supporting bulk synchronous parallelism in map-reduce queries. In International Workshop on Data Intensive Computing in the Clouds (DataCloud).Google Scholar
Fegaras, L. (2016) Incremental query processing on big data streams. IEEE Trans. Knowl. Data Eng. 28 (11), 29983012. Available at: https://lambda.uta.edu/tkde16-preprint.pdf.Google Scholar
Fegaras, L., Li, C., Gupta, U. & Philip, J. J. (2011) XML query optimization in map-reduce. In International Workshop on the Web and Databases (WebDB).Google Scholar
Fegaras, L., Li, C. & Gupta, U. (2012) An optimization framework for map-reduce queries. In International Conference on Extending Database Technology (EDBT). pp. 26–37.Google Scholar
Fegaras, L. & Maier, D. (1995) Towards an effective calculus for object query languages. In International Conference on Management of Data (SIGMOD). pp. 47–58.Google Scholar
Fegaras, L. & Maier, D. (2000) Optimizing object queries using an effective calculus. ACM Trans. Database Syst. (TODS) 25 (4), 457516. Available at: https://lambda.uta.edu/tods00.pdf.Google Scholar
Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., Reed, B., Srinivasan, S. & Srivastava, U. (2009) Building a high-level dataflow system on top of map-reduce: the pig Experience. Proc. VLDB Endowment (PVLDB) 2 (2), 14141425.Google Scholar
Gibbons, J. (1996) The third homomorphism theorem. J. Funct. Program. 6 (4), 657665.Google Scholar
Gibbons, J. (2016) Comprehending ringads: For Phil Wadler on the occasion of his 60th birthday. In A List of Successes That Can Change the World. Springer 2016, LNCS, vol. 9600, pp. 132151.Google Scholar
Giorgidze, G., Grust, T., Schweinsberg, N. & Weijers, J. (2011) Bringing back monad comprehensions. In Haskell Symposium, pp. 13–22.Google Scholar
Grust, T. & Scholl, M. H. (1999) How to comprehend queries functionally. J. Intell. Inform. Syst. 12 (2–3), 191218.Google Scholar
Holsch, J., Grossniklaus, M. & Scholl, M. H. (2016) Optimization of nested queries using the NF2 algebra. In ACM SIGMOD International Conference on Management of Data. pp. 1765–1780.Google Scholar
Isard, M. & Yu, Y. (2009) Distributed data-parallel computing using a high-level programming language. In ACM SIGMOD International Conference on Management of Data. pp. 987–994.Google Scholar
Lin, J. & Dyer, C. (2010) Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers.Google Scholar
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C. & Hellerstein, J. M. (2012) Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endowment (PVLDB) 5 (8), 716727.Google Scholar
Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N. & Czajkowski, G. (2010) Pregel: A system for large-scale graph processing. In ACM SIGMOD International Conference on Management of Data. pp. 135–146.Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R. & Tomkins, A. (2008) Pig Latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data. pp. 1099–1110.Google Scholar
Power, R. & Li, J. (2010) Piccolo: Building fast, distributed programs with partitioned tables. In Symposium on Operating System Design and Implementation (OSDI).Google Scholar
Shinnar, A., Cunningham, D., Herta, B. & Saraswat, V. (2012) M3R: Increased performance for in-memory hadoop jobs. Proc. VLDB Endowment (PVLDB) 5 (12), 17361747.Google Scholar
Steele, G. L. Jr. (2009) Organizing functional code for parallel execution or, foldl and foldr considered slightly harmful. In ICFP. pp. 1–2.Google Scholar
Tannen, V. B., Buneman, P. & Naqvi, S. (1991) Structural recursion as a query language. In International Workshop on Database Programming Languages: Bulk Types and Persistent Data (DBPL). pp. 9–19.Google Scholar
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Antony, S., Liu, H., Wyckoff, P. & Murthy, R. (2009) Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endowment (PVLDB) 2 (2), 16261629.Google Scholar
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H. & Murthy, R. (2010) Hive: A petabyte scale data warehouse using hadoop. In IEEE International Conference on Data Engineering (ICDE). pp. 996–1005.Google Scholar
Trinder, P. & Wadler, P. (1989) Improving list comprehension database queries. In TENCON. pp. 186–192.Google Scholar
Trinder, P. W. (1991) Comprehensions, a query notation for DBPLs. In International Workshop on Database Programming Languages (DBPL). pp. 55–68.Google Scholar
Valiant, L. G. (1990) A bridging model for parallel computation. Commun. ACM (CACM) 33 (8), 103111.Google Scholar
Wadler, P. (1990) Comprehending monads. In ACM Symposium on Lisp and Functional Programming. pp. 61–78.Google Scholar
Wadler, P. (1987) List comprehensions. In The Implementation of Functional Programming Languages, Peyton Jones, S. (ed.). Prentice Hall, Chapter 7.Google Scholar
Wadler, P. & Peyton Jones, S. (2007) Comprehensive comprehensions (comprehensions with ‘Order by’ and ‘Group by’). In Haskell Symposium. pp. 61–72.Google Scholar
Wong, L. (2000) Kleisli, a functional query system. J. Funct. Program. 10 (1), 1956.Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. & Stoica, I. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium on Networked Systems Design and Implementation (NSDI).Google Scholar
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S. & Stoica, I. (2013) Discretized streams: Fault-tolerant streaming computation at scale. In Symposium on Operating Systems Principles (SOSP).Google Scholar
Submit a response

Discussions

No Discussions have been published for this article.