Skip to main content Accessibility help
×
×
Home

Scaling-up reasoning and advanced analytics on BigData

  • TYSON CONDIE (a1), ARIYAM DAS (a1), MATTEO INTERLANDI (a1), ALEXANDER SHKAPSKY (a1), MOHAN YANG (a1) and CARLO ZANIOLO (a1)...

Abstract

BigDatalog is an extension of Datalog that achieves performance and scalability on both Apache Spark and multicore systems to the point that its graph analytics outperform those written in GraphX. Looking back, we see how this realizes the ambitious goal pursued by deductive database researchers beginning 40 years ago: this is the goal of combining the rigor and power of logic in expressing queries and reasoning with the performance and scalability by which relational databases managed BigData. This goal led to Datalog which is based on Horn Clauses like Prolog but employs implementation techniques, such as semi-naïve fixpoint and magic sets, that extend the bottom-up computation model of relational systems, and thus obtain the performance and scalability that relational systems had achieved, as far back as the 80s, using data-parallelization on shared-nothing architectures. But this goal proved difficult to achieve because of major issues at (i) the language level and (ii) at the system level. The paper describes how (i) was addressed by simple rules under which the fixpoint semantics extends to programs using count, sum and extrema in recursion, and (ii) was tamed by parallel compilation techniques that achieve scalability on multicore systems and Apache Spark. This paper is under consideration for acceptance in Theory and Practice of Logic Programming.

    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Scaling-up reasoning and advanced analytics on BigData
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Scaling-up reasoning and advanced analytics on BigData
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Scaling-up reasoning and advanced analytics on BigData
      Available formats
      ×

Copyright

Footnotes

Hide All

*This work was supported in part by NSF under Grants IIS-1218471, IIS-1302698 and CNS-1351047, and in part by NIH BigData to Knowledge (BD2K) under Grant U54EB020404.

Footnotes

References

Hide All
Abiteboul, S. and Hull, R. 1988. Data functions, datalog and negation (extended abstract). In Proc. of ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, June 1–3, 143–153.
Abiteboul, S., Hull, R. and Vianu, V., Eds. 1995. Foundations of Databases: The Logical Level, 1st ed., Addison-Wesley Longman Publishing, Boston, MA, USA.
Agrawal, R. et al. 1994. Fast algorithms for mining association rules. In Proc. of 20th International Conference on Very Large Data Bases, Vol. 1215, 487–499.
Ameloot, T. J., Neven, F. and Van den Bussche, J. 2011. Relational transducers for declarative networking. In Proc. of 30th Principles of Database Systems (PODS), 283–292.
Aref, M. et al. 2015. Design and implementation of the logicblox system. In Proc. of International Conference on Management of Data (SIGMOD). ACM, 1371–1382.
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A. and Zaharia, M. 2015. Spark SQL: Relational data processing in spark. In Proc. of International Conference on Management of Data (SIGMOD), 1383–1394.
Arni, F., Ong, K., Tsur, S., Wang, H. and Zaniolo, C. 2003. The deductive database system LDL++. Theory and Practice of Logic Programming 3, 1, 6194.
Bell, D. A., Shao, J. and Hull, M. E. C. 1991. A pipelined strategy for processing recursive queries in parallel. Data & Knowledge Engineering 6, 5, 367391.
Borkar, V. R. et al. 2012. Declarative systems for large-scale machine learning. IEEE Data Engineering Bulletin 35, 2, 2432.
Borkar, V. R., Carey, M. J., Grover, R., Onose, N. and Vernica, R. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proc. of 27th International Conference on Data Engineering (ICDE), 1151–1162.
Bu, Y., Borkar, V. R., Carey, M. J., Rosen, J., Polyzotis, N., Condie, T., Weimer, M. and Ramakrishnan, R. 2012. Scaling datalog for machine learning on big data. CoRR abs/1203.0160.
Cardoso, J. C., Baquero, C. and Almeida, P. S. 2009. Probabilistic estimation of network size and diameter. In Proc. of 4th Latin-American Symposium on Dependable Computing (LADC'09). IEEE, 33–40.
Chimenti, D., O'Hare, A. B., Krishnamurthy, R., Tsur, S., West, C. and Zaniolo, C. 1987. An overview of the LDL system. IEEE Data Engineering Bulletin 10, 4, 5262.
Cohen, S. and Wolfson, O. 1989. Why a single parallelization strategy is not enough in knowledge bases. In Proc. of 8th Principles of Database Systems (PODS), 200–216.
Condie, T., Chu, D., Hellerstein, J. M. and Maniatis, P. 2008. Evita raced: Metacompilation for declarative networks. Proceedings of the VLDB Endowment 1, 1, 11531165.
Conway, N., Marczak, W. R., Alvaro, P., Hellerstein, J. M. and Maier, D. 2012. Logic and lattices for distributed programming. In ACM Symposium on Cloud Computing (SOCC '12). San Jose, CA, USA, October 14–17.
Das, A. and Zaniolo, C. 2016. Fast lossless frequent itemset mining in data streams using crucial patterns. In Proc. of SIAM International Conference on Data Mining. Miami, Florida, USA, May 5–7, 576–584.
de Kergommeaux, J. C. and Codognet, P. 1994. Parallel logic programming systems. ACM Computing Surveys 26, 3, 295336.
Dean, J. and Ghemawat, S. 2004. Mapreduce: Simplified data processing on large clusters. In Proc. of 6th Symposium on Operating System Design and Implementation (OSDI), 137–150.
Erdem, E., Gelfond, M. and Leone, N. 2016. Applications of answer set programming. AI Magazine 37, 3, 5368.
Faber, W., Pfeifer, G. and Leone, N. 2011. Semantics and complexity of recursive aggregates in answer set programming. Artificial Intelligence 175, 1, 278298.
Fang, M., Shivakumar, N., Garcia-molina, H., Motwani, R. and Ullman, J. D. 1998. Computing iceberg queries efficiently. In Proc. of 24rd International Conference on Very Large Data Bases (VLDB), 299–310.
Ganguly, S., Greco, S. and Zaniolo, C. 1991. Minimum and maximum predicates in logic programming. In Proc. of 10th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '91), 154–163.
Ganguly, S., Greco, S. and Zaniolo, C. 1995. Extrema predicates in deductive databases. Journal of Computer and System Sciences 51, 2, 244259.
Ganguly, S., Silberschatz, A. and Tsur, S. 1990. A framework for the parallel processing of datalog queries. In Proc. of International Conference on Management of Data (SIGMOD), 143–152.
Ganguly, S., Silberschatz, A. and Tsur, S. 1992. Parallel bottom-up processing of datalog queries. Journal of Logic Programming 14, 1, 101126.
Gebser, M., Kaminski, R., Kaufmann, B. and Schaub, T. 2014. Clingo= asp + control: Preliminary report. arXiv:1405.3694.
Gelfond, M. and Zhang, Y. 2014. Vicious circle principle and logic programs with aggregates. Theory and Practice of Logic Programming 14, 4–5, 587601. CoRR abs/1405.3637.
Giacometti, A., Li, D. H., Marcel, P. and Soulet, A. 2014. 20 years of pattern mining: A bibliometric survey. SIGKDD Explorations Newsletter 15, 1, 4150.
Giannotti, F. and Manco, G. 2002. LDL-Mine: Integrating data mining with intelligent query answering. In Proc. of Logics in Artificial Intelligence, European Conference, JELIA, Cosenza, Italy, September, 23–26, 517–520.
Giannotti, F., Manco, G. and Turini, F. 2004. Specifying mining algorithms with iterative user-defined aggregates. IEEE Transactions on Knowledge and Data Engineering 16, 10, 12321246.
Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J. and Stoica, I. 2014. Graphx: Graph processing in a distributed dataflow framework. In Proc. of 11th USENIX Conference on Operating Systems Design and Implementation (OSDI), 599–613.
Greco, S., Zaniolo, C. and Ganguly, S. 1992. Greedy by choice. In Proc. of 11th Symposium on Principles of Database Systems (PODS). ACM, 105–113.
Gupta, G., Pontelli, E., Ali, K. A., Carlsson, M. and Hermenegildo, M. V. 2001. Parallel execution of prolog programs: A survey. ACM Transactions on Programming Languages and Systems 23, 4, 472602.
Halperin, D., de Almeida, V. T., Choo, L. L., Chu, S., Koutris, P., Moritz, D., Ortiz, J., Ruamviboonsuk, V., Wang, J., Whitaker, A., Xu, S., Balazinska, M., Howe, B. and Suciu, D. 2014. Demonstration of the myria big data management service. In Proc. of International Conference on Management of Data (SIGMOD), Snowbird, UT, USA, June 22–27, 881–884.
Han, J., Pei, J. and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proc. of International Conference on Management of Data (SIGMOD). ACM, 1–12.
Hu, T., Sung, S. Y., Xiong, H. and Fu, Q. 2008. Discovery of maximum length frequent itemsets. Information Sciences 178, 1, 6987.
Interlandi, M. and Tanca, L. 2015. On the CALM principle for BSP computation. In Proc. of Alberto Mendelzon International Workshop on Foundations of Data Management.
Kang, U., Tsourakakis, C. E., Appel, A. P., Faloutsos, C. and Leskovec, J. 2011. Hadi: Mining radii of large graphs. ACM Transactions on Knowledge Discovery from Data 5, 2, 8:18:24.
Kemp, D. B. and Stuckey, P. J. 1991. Semantics of logic programs with aggregates. In Proc. of International Symposium on Logic Programming (ISLP). 387–401.
Kowalski, R. A. 1979. Algorithm = logic + control. Communications of the ACM 22, 7, 424436.
Leone, N. et al. 2006. The DLV system for knowledge representation and reasoning. Transactions on Computational Logic 7, 3, 499562.
Lewis, D. D. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proc. of 10th European Conference on Machine Learning (ECML '98). Springer-Verlag, London, UK, 4–15.
Lifschitz, S. and Vianu, V. 1998. A probabilistic view of datalog parallelization. Theoretical Computer Science 190, 2, 211239.
Loo, B. T., Condie, T., Garofalakis, M. N., Gay, D. E., Hellerstein, J. M., Maniatis, P., Ramakrishnan, R., Roscoe, T. and Stoica, I. 2006. Declarative networking: Language, execution and optimization. In Proc. of International Conference on Management of Data (SIGMOD). ACM, 97–108.
Loo, B. T., Condie, T., Hellerstein, J. M., Maniatis, P., Roscoe, T. and Stoica, I. 2005. Implementing declarative overlays. In Proc. of 20th ACM Symposium on Operating Systems Principles (SOSP). ACM, 75–90.
Martínez-Angeles, C. A., Dutra, I. and Costa, V. S. 2014. A datalog engine for GPUs. Declarative Programming and Knowledge Management, Springer, 152168.
Martínez-Angeles, C. A., Wu, H., Dutra, I., Costa, V. S. and Buenabad-Chávez, J. 2016. Relational learning with GPUs: Accelerating rule coverage. International Journal of Parallel Programming 44, 3, 663685.
Matula, D. W. and Beck, L. L. 1983. Smallest-last ordering and clustering and graph coloring algorithms. Journal of the ACM 30, 3, 417427.
Mazuran, M., Serra, E. and Zaniolo, C. 2013a. A declarative extension of horn clauses, and its significance for datalog and its applications. Theory and Practice of Logic Programming 13, 4–5, 609623.
Mazuran, M., Serra, E. and Zaniolo, C. 2013b. Extending the power of datalog recursion. The VLDB Journal 22, 4, 471493.
Minker, J., Seipel, D. and Zaniolo, C. 2014. Logic and databases: A history of deductive databases. In Computational Logic, Elsevier, 571627.
Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, Boston, MA.
Morris, K. A., Ullman, J. D. and Gelder, A. V. 1986. Design overview of the nail! system. In Proc. of 3rd International Conference on Logic Programming, Imperial College of Science and Technology. London, UK, July 14–18, 554–568.
Motik, B., Nenov, Y., Piro, R., Horrocks, I. and Olteanu, D. 2014. Parallel materialisation of datalog programs in centralised, main-memory RDF systems. In Proc. of 28th AAAI Conference on Artificial Intelligence (AAAI'14). AAAI Press, 129–137.
Mumick, I. S., Pirahesh, H. and Ramakrishnan, R. 1990. The magic of duplicates and aggregates. In Proc. of 16th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers, 264–277.
Murray, D. G., McSherry, F., Isaacs, R., Isard, M., Barham, P. and Abadi, M. 2013. Naiad: A timely dataflow system. In Proc. of 24th Symposium on Operating Systems Principles (SOSP), 439–455.
Mutharaju, R., Maier, F. and Hitzler, P. 2010. A mapreduce algorithm for SC. In Proc. of 23rd International Workshop on Description Logics (DL'10), 456.
Pelov, N., Denecker, M. and Bruynooghe, M. 2007. Well-founded and stable semantics of logic programs with aggregates. Theory and Practice of Logic Programming 7, 3, 301353.
Przymusinski, T. C. 1988. Perfect model semantics. In Proc. of International Conference and Symposium on Logic Programming (ICLP/SLP), 1081–1096.
Quinlan, J. R. 1986. Induction of decision trees. Machine Learning 1, 1, 81106.
Ramakrishnan, R., Srivastava, D. and Sudarshan, S. 1992. CORAL – Control, relations and logic. In Proc. of 18th International Conference on Very Large Data Bases, August 23-27. Vancouver, Canada, 238–250.
Ross, K. A. and Sagiv, Y. 1992. Monotonic aggregation in deductive databases. In Proc. of 11th Symposium on Principles of Database Systems (PODS). ACM, 114–126.
Seib, J. and Lausen, G. 1991. Parallelizing datalog programs by generalized pivoting. In Proc. of 10th Symposium on Principles of Database Systems (PODS), 241–251.
Seo, J., Guo, S. and Lam, M. S. 2013. SociaLite: Datalog extensions for efficient social network analysis. In Proc. of International Conference on Data Engineering (ICDE'13). IEEE, 278–289.
Seo, J., Park, J., Shin, J. and Lam, M. S. 2013. Distributed socialite: A datalog-based language for large-scale graph analysis. Proceedings of the VLDB Endowment 6, 14, 19061917.
Shin, K., Eliassi-Rad, T. and Faloutsos, C. 2016. Corescope: Graph mining using k-core analysis – Patterns, anomalies and algorithms. In Proc. of 16th International Conference on Data Mining (ICDM). IEEE, 469–478.
Shkapsky, A., Yang, M., Interlandi, M., Chiu, H., Condie, T. and Zaniolo, C. 2016. Big data analytics with datalog queries on spark. In Proc. of 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1135–1149.
Shkapsky, A., Zeng, K. and Zaniolo, C. 2013. Graph queries in a next-generation datalog system. Proceedings of the VLDB Endowment 6, 12, 12581261.
Simons, P., Niemelä, I. and Soininen, T. 2002. Extending and implementing the stable model semantics. Artificial Intelligence 138, 1–2, 181234.
Son, T. C. and Pontelli, E. 2007. A constructive semantic characterization of aggregates in answer set programming. Theory and Practice of Logic Programming 7, 3, 355375.
Sudarshan, S. and Ramakrishnan, R. 1991. Aggregation and relevance in deductive databases. In Proc. of 17th International Conference on Very Large Data Bases (VLDB), 501–511.
Swift, T. and Warren, D. S. 2010. Tabling with answer subsumption: Implementation, applications and performance. In Proc. of European Workshop on Logics in Artificial Intelligence (JELIA). 300–312.
Swift, T. and Warren, D. S. 2012. XSB: Extending prolog with tabled logic programming. Theory and Practice of Logic Programming 12, 1–2, 157187.
Tachmazidis, I., Antoniou, G. and Faber, W. 2014. Efficient computation of the well-founded semantics over big data. Theory and Practice of Logic Programming 14, 4–5, 445459.
Tachmazidis, I., Antoniou, G., Flouris, G., Kotoulas, S. and McCluskey, L. 2012. Large-scale parallel stratified defeasible reasoning. In Proc. of 20th European Conference on Artificial Intelligence. IOS Press, 738–743.
Tsur, S. 1991. Deductive databases in action. In Proc. of 10th Symposium on Principles of Database Systems (PODS '91). ACM, New York, NY, USA, 142–153.
Urbani, J., Jacobs, C. J. and Krötzsch, M. 2016. Column-oriented Datalog Materialization for large knowledge graphs. In Proc. of 30th Conference on Artificial Intelligence (AAAI), 258–264.
Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F. and Bal, H. 2012. Webpie: A web-scale parallel inference engine using MapReduce. Web Semantics: Science, Services and Agents on the World Wide Web 10, 5975.
Vaghani, J., Ramamohanarao, K., Kemp, D. B., Somogyi, Z., Stuckey, P. J., Leask, T. S. and Harland, J. 1994. The Aditi deductive database system. VLDB Journal 3, 2, 245288.
Van Gelder, A. 1993. Foundations of aggregation in deductive databases. In Proc. of International Conference on Deductive and Object-Oriented Databases. Springer, 13–34.
Venu, B. 2011. Multi-core processors – An overview. CoRR abs/1110.3535.
Wang, J., Balazinska, M. and Halperin, D. 2015. Asynchronous and fault-tolerant recursive Datalog evaluation in shared-nothing engines. Proceedings of the VLDB Endowment 8, 12, 15421553.
Wolfson, O. and Ozeri, A. 1990. A new paradigm for parallel and distributed rule-processing. In Proc. of International Conference on Management of Data (SIGMOD), 133–142.
Wolfson, O. and Silberschatz, A. 1988. Distributed processing of logic programs. In Proc. of International Conference on Management of Data (SIGMOD), 329–336.
Yang, M. 2017. Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery. Ph.D. thesis, UCLA.
Yang, M., Shkapsky, A. and Zaniolo, C. 2015. Parallel bottom-up evaluation of logic programs: DeALS on shared-memory multicore machines. In Technical Communications of ICLP, Cork, Ireland.
Yang, M., Shkapsky, A. and Zaniolo, C. 2017. Scaling up the performance of more powerful datalog systems on multicore machines. VLDB Journal 26, 2, 229248.
Yang, M. and Zaniolo, C. 2014. Main memory evaluation of recursive queries on multicore machines. In Proc. of IEEE International Conference on Big Data, 251–260.
Yu, Y., Gunda, P. K. and Isard, M. 2009. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Proc. of 22nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 247–260.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. and Stoica, I. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2–2.
Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A. and Condie, T. 2017. Fixpoint semantics and optimization of recursive datalog programs with aggregates. Theory and Practice of Logic Programming 17, 5–6, 10481065.
Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A. and Condie, T. 2018. Declarative bigdata algorithms via aggregates and relational database dependencies. In Proc. of 12th Alberto Mendelzon International Workshop on Foundations of Data Management, Cali, Colombia, May 21–25.
Zhang, W., Wang, K. and Chau, S.-C. 1995. Data partition and parallel evaluation of datalog programs. IEEE Transactions on Knowledge and Data Engineering 7, 1, 163176.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Theory and Practice of Logic Programming
  • ISSN: 1471-0684
  • EISSN: 1475-3081
  • URL: /core/journals/theory-and-practice-of-logic-programming
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Keywords

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed