Hostname: page-component-848d4c4894-nmvwc Total loading time: 0 Render date: 2024-06-21T19:01:35.216Z Has data issue: false hasContentIssue false

Algebraic data integration*

Published online by Cambridge University Press:  02 November 2017

Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA (e-mail:
Categorical Informatics, Inc., Cambridge, MA, USA (e-mail:
Rights & Permissions [Opens in a new window]


Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

In this paper, we develop an algebraic approach to data integration by combining techniques from functional programming, category theory, and database theory. In our formalism, database schemas and instances are algebraic (multi-sorted equational) theories of a certain form. Schemas denote categories, and instances denote their initial (term) algebras. The instances on a schema S form a category, SInst, and a morphism of schemas F : ST induces three adjoint data migration functors: ΣF : SInstTInst, defined by substitution along F, which has a right adjoint ΔF : TInstSInst, which in turn has a right adjoint ΠF : SInstTInst. We present a query language based on for/where/return syntax where each query denotes a sequence of data migration functors; a pushout-based design pattern for performing data integration using our formalism; and describe the implementation of our formalism in a tool we call AQL (Algebraic Query Language).

Research Article
Copyright © Cambridge University Press 2017 



The authors would like to thank David Spivak and Peter Gates. Patrick Schultz was supported by AFOSR grant FA9550-14-1-0031, ONR grant N000141310260, and NASA grant NNH13ZEA001N. Ryan Wisnesky was supported by NIST SBIR grant 70NANB15H290.


Abiteboul, S., Hull, R. & Vianu, V. (1995) Foundations of Databases. Addison-Wesley-Longman.Google Scholar
Adámek, J., Rosický, J. & Vitale, E. M. (2011) Algebraic Theories. Cambridge Tracts in Mathematics, vol. 184. Cambridge University Press.Google Scholar
Alagic, S. & Bernstein, P. (2001) A model theory for generic schema management. In Proceedings of the 8th International Workshop on Database Programming Languages, 228–246.Google Scholar
Angles, R. & Gutierrez, C. (2008) Survey of graph database models. ACM Comput. Surv., 40 (1).Google Scholar
Baader, F. & Nipkow, T. (1998) Term Rewriting and All That. Cambridge University Press.Google Scholar
Bachmair, L., Dershowitz, N. & Plaisted, D. A. (1989) Completion without failure. Resolution Equ. Algebr. Struct. – Rewriting Tech. 2.Google Scholar
Barr, M. & Wells, C. (1995) Category Theory for Computing Science. Prentice Hall International.Google Scholar
Bertot, Y. & Castéran, P. (2010) Interactive Theorem Proving and Program Development: Coq'art the Calculus of Inductive Constructions. Springer.Google Scholar
Blum, E. K., Ehrig, H. & Parisi-Presicce, F. (1987) Algebraic specification of modules and their basic interconnections. J. Comput. Syst. Sci. 34 (2–3), 293339.Google Scholar
Bush, M. R., Leeming, M. & Walters, R. F. C. (2003) Computing left Kan extensions. J. Symbol. Comput. 35 (2), 107126.Google Scholar
Doan, H., Halevy, A. & Ives, Z. (2012) Principles of Data Integration. Morgan Kaufmann.Google Scholar
Enderton, H. B. (2001) A Mathematical Introduction to Logic. Academic Press.Google Scholar
Fagin, R., Kolaitis, P. G., Miller, R. J. & Popa, L. (2005b) Data exchange: Semantics and query answering. Theoretical Computer Science 336 (1), 89124.Google Scholar
Fagin, R., Kolaitis, P. G., Popa, L. & Tan, W. (2005a) Composing schema mappings: Second-order dependencies to the rescue. ACM Transactions on Database Systems 30 (4), 9941055.Google Scholar
Fleming, M., Gunther, R. & Rosebrugh, R. (2003) A database of categories. J. Symbol. Comput. 35 (2), 127135.Google Scholar
Garcia-Molina, H., Ullman, J. D. & Widom, J. (2008) Database Systems: The Complete Book, 2nd ed. Upper Saddle River, NJ, USA: Prentice Hall.Google Scholar
Ghilardi, S., Lutz, C. & Wolter, F. (2006) Did I damage my ontology? A case for conservative extensions in description logics. In Proceedings of the 10th International Conference on Principles of Knowledge Representation and Reasoning, 187–197.Google Scholar
Goguen, J. A. & Burstall, R. M. (1984) Introducing institutions. In Proceedings of the Carnegie Mellon Workshop on Logic of Programs, 221–256.Google Scholar
Goguen, J. (2004) Information Integration in Institutions [online]. Available at: Google Scholar
Grust, T. (2004) Monad comprehensions: A versatile representation for queries. In The Functional Approach to Data Management: Modeling, Analyzing and Integrating Heterogeneous Data, Gray, P. M. D., Kerschberg, L., King, P. J. H. & Poulovassilis, A. (eds). Berlin, Heidelberg: Springer, 288311.Google Scholar
Haas, L. M., Hernández, M. A., Ho, H., Popa, L. & Roth, M. (2005) Clio grows up: From research prototype to industrial tool. In Proceedings of the International Conference on Management of Data, 805–810.Google Scholar
Johnson, M., Rosebrugh, R. & Wood, R. J. (2002) Entity-Relationship-Attribute designs and sketches. Theory Appl. Categories 10, 94112.Google Scholar
Kapur, D. & Narendran, P. (1985) The Knuth–Bendix completion procedure and Thue systems. SIAM J. Comput. 14 (4), 10521072.Google Scholar
Knuth, D. & Bendix, P. (1970) Simple word problems in universal algebra. In Computational Problems in Abstract Algebra, Leech, J. (ed). Springer.Google Scholar
Lüth, C., Roggenbach, M. & Schröder, L. (2005) CCC – The CASL consistency checker. In Recent Trends in Algebraic Development Techniques, 17th International Workshop, Fiadeiro, J. (ed). Lecture Notes in Computer Science, vol. 3423, Springer, 94105.Google Scholar
Melnik, S., Rahm, E. & Bernstein, P. A. (2003) Rondo: A programming platform for generic model management. In Proceedings of the International Conference on Management of Data, 193–204.Google Scholar
Mitchell, J. C. (1996) Foundations of Programming Languages. MIT Press.Google Scholar
Mossakowski, T., Krumnack, U. & Maibaum, T. (2014) What is a derived signature morphism? In Recent Trends in Algebraic Development Techniques, Lecture Notes in Computer Science, vol 9463, Springer, 90109.Google Scholar
Mossakowski, T., Rabe, F. & Mihai, C. (2017) Canonical selection of colimits, Algebraic Model Management: A Survery. arXiv:1705.09363.Google Scholar
Nelson, G. & Oppen, D. C. (1980) Fast decision procedures based on congruence closure. J. ACM 27 (2), 356364.Google Scholar
Patterson, E. (2017) Knowledge representation in bicategories of relations. arXiv:1706.00526.Google Scholar
Roberson, E. L. & Wyss, C. M. (2004) Optimal tuple merge is NP-complete. Bloomington: Indiana University School of Informatics and Computing.Google Scholar
Schultz, P., Spivak, D. I., Vasilakopoulou, C. & Wisnesky, R. (2017) Algebraic databases. Theory Appl. Categ. 32.Google Scholar
Schultz, P., Spivak, D. I. & Wisnesky, R. (2015) QINL: Query-integrated languages. arXiv: 1511.06459.Google Scholar
Schultz, P., Spivak, D. I. & Wisnesky, R. (2016) Algebraic model management: A survey. In Proceedings of the Workshop on Algebraic Development Techniques.Google Scholar
Shipman, D. W. (1981) The functional data model and the data languages daplex. ACM Trans. database Syst. 6 (1), 140173.Google Scholar
Spivak, D. I. (2012) Functorial data migration. Inform. Comput. 217, 3151.Google Scholar
Spivak, D. I. (2014) Database queries and constraints via lifting problems. Math. Struct. Comput. Sci. 24 (6).Google Scholar
Spivak, D. I. & Wisnesky, R. (2015) Relational foundations for functorial data migration. In Proceedings of the 15th Symposium on Database Programming Languages, 21–28.Google Scholar
Zieliński, B., Maślanka, P. & Sobieski, Ś. (2013) Allegories for database modeling. In Model and Data Engineering: 3rd International Conference, 278–289.Google Scholar
Submit a response


No Discussions have been published for this article.