Skip to main content Accessibility help
×
×
Home

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

  • TED ENAMORADO (a1), BENJAMIN FIFIELD (a2) and KOSUKE IMAI (a3)

Abstract

Since most social science research relies on multiple data sources, merging data sets is an essential part of researchers’ workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable, and data may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a fast and scalable algorithm to implement a canonical model of probabilistic record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. An open-source software package is available for implementing the proposed methodology.

Copyright

Corresponding author

*Ted Enamorado, Ph.D. Candidate, Department of Politics, Princeton University, tede@princeton.edu, http://www.tedenamorado.com.
Benjamin Fifield, Ph.D. Candidate, Department of Politics, Princeton University, bfifield@princeton.edu, http://www.benfifield.com.
Kosuke Imai, Professor, Department of Government and Department of Statistics, Harvard University. imai@harvard.edu, https://imai.fas.harvard.edu.

Footnotes

Hide All

The proposed methodology is implemented through an open-source R package, fastLink: Fast Probabilistic Record Linkage, which is freely available for download at the Comprehensive R Archive Network (CRAN; https://CRAN.R-project.org/package=fastLink). We thank Bruce Willsie of L2 and Steffen Weiss of YouGov for data and technical assistance, Jake Bowers, Seth Hill, Johan Lim, Marc Ratkovic, Mauricio Sadinle, five anonymous reviewers, and audiences at the 2017 Annual Meeting of the American Political Science Association, Columbia University (Political Science), Fifth Asian Political Methodology Meeting, Gakusyuin University (Law), Hong Kong University of Science and Technology, the Institute for Quantitative Social Science (IQSS) at Harvard University, the Quantitative Social Science (QSS) colloquium at Princeton University, Universidad de Chile (Economics), Universidad del Desarrollo, Chile (Government), the 2017 Summer Meeting of the Society for Political Methodology, the Center for Statistics and the Social Sciences (CSSS) at the University of Washington for useful comments and suggestions. Replication materials can be found on Dataverse at: https://doi.org/10.7910/DVN/YGUHTD.

Footnotes

References

Hide All
Adena, Maja, Enikolopov, Ruben, Petrova, Maria, Santarosa, Veronica, and Zhuravskaya, Ekaterina. 2015. “Radio and the Rise of the Nazis in Prewar Germany.” Quarterly Journal of Economics 130: 1885–939.
Ansolabehere, Stephen, and Hersh, Eitan. 2012. “Validation: What Big Data Reveal about Survey Misreporting and the Real Electorate.” Political Analysis 20: 437–59.
Ansolabehere, Stephen, and Hersh, Eitan. 2017. “ADGN: An Algorithm for Record Linkage Using Address, Date of Birth, Gender and Name.” Statistics and Public Policy 4: 110.
Belin, Thomas R., and Rubin, Donald B.. 1995. “A Method for Calibrating False-Match Rates in Record Linkage.” Journal of the American Statistical Association 90: 694707.
Berent, Matthew K., Krosnick, Jon Arthur, and Lupia, A.. 2016. “Measuring Voter Registration and Turnout in Surveys. Do Official Government Records Yield More Accurate Assessments?Public Opinion Quarterly . 80: 597621.
Bolsen, Toby, Ferraro, Paul J., and Miranda, Juan Jose. 2014. “Are Voters More Likely to Contribute to Other Public Goods? Evidence from a Large-Scale Randomized Policy Experiment.” American Journal of Political Science 58: 1730.
Bonica, Adam. 2013. Database on Ideology, Money in Politics, and Elections: Public Version 1.0 [Computer File]. Stanford, CA: Stanford University Libraries.
Cesarini, David, Lindqvist, Erik, Ostling, Robert, and Wallace, Bjorn. 2016. “Wealth, Health, and Child Development: Evidence from Administrative Data on Swedish Lottery Players.” Quarterly Journal of Economics 131: 687738.
Christen, Peter. 2012. Data Matching. Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection . Heidelberg, Germany: Springer.
Cohen, William W., Ravikumar, Pradeep, and Fienberg, Stephen. 2003. “A Comparison of String Distance Metrics for Name-Matching Tasks.” In International Joint Conference on Artificial Intelligence (IJCAI) 18.
Cross, Philip J., and Manski, Charles F.. 2002. “Regressions, Short and Long.” Econometrica 70: 357–68.
Dalzell, Nicole M., and Reiter, Jerome P.. 2018. “Regression Modeling and File Matching Using Possibly Erroneous Matching Variables.” Journal of Computational and Graphical Statistics 111. Published online July 11, 2018.
de Bruin, Jonathan. 2017. “Record Linkage. Python library. Version 0.8.1.” https://recordlinkage.readthedocs.io/.
Einav, Liran, and Levin, Jonathan. 2014. “Economics in the Age of Big Data.” Science 346 (6210): 1243089-16.
Enamorado, Ted. 2018. “Active Learning for Probabilisitic Record Linkage.” Social Science Research Network (SSRN). URL: https://ssrn.com/abstract=3257638.
Engbom, Niklas, and Moser, Christian. 2017. “Returns to Education Through Access to Higher-Paying Firms: Evidence from US Matched Employer-Employee Data.” American Economic Review: Papers and Proceedings 107: 374–78.
Feigenbaum, James. 2016. Automated Census Record Linking: A Machine Learning Approach. Boston University. Technical Report. https://jamesfeigenbaum.github.io/research/pdf/census-link-ml.pdf
Fellegi, Ivan P., and Sunter, Alan B.. 1969. “A Theory of Record Linkage.” Journal of the American Statistical Association 64: 1183–210.
Figlio, David, and Guryan, Jonathan. 2014. “The Effects of Poor Neonatal Health on Children’s Cognitive Development.” American Economic Review 104: 3921–55.
Giraud-Carrier, Christophe, Goodlife, Jay, Jones, Bradley M., and Cueva, Stacy. 2015. “Effective Record Linkage for Mining Campaign Contribution Data.” Knowledge and Information Systems 45: 389416.
Goldstein, Harvey, and Harron, Katie. 2015. Methodological Developments in Data Linkage . John Wiley & Sons, Ltd. Chapter 6: Record Linkage: A Missing Data Problem, pp. 109–24.
Gutman, Roee, Afendulis, Christopher C., and Zaslavsky, Alan M.. 2013. “A Bayesian Procedure for File Linking to End-of-Life Medical Costs.” Journal of the American Medical Informatics Association . 103: 3447.
Gutman, Roee, Sammartino, Cara J., Green, Traci C., and Montague, Brian T.. 2016. “Error Adjustments for File Linking Methods Using Encrypted Unique Client Identifier (eUCI) with Application to Recently Released Prisoners Who Are HIV+.” Statistics in Medicine 35: 115–29.
Harron, Katie, Goldstein, Harvey, and Dibben, Chris, eds. 2015. Methodological Developments in Data Linkage. West Sussex: John Wiley & Sons.
Hersh, Eitan D. 2015. Hacking the Electorate: How Campaigns Perceive Voters. Cambridge, U.K.: Cambridge University Press.
Herzog, Thomas H., Scheuren, Fritz, and Winkler, William E.. 2010. “Record Linkage.” Wiley Interdisciplinary Reviews: Computational Statistics 2: 535–43.
Herzog, Thomas N., Scheuren, Fritz J., and Winkler, William E.. 2007. Data Quality and Record Linkage Techniques. New York: Springer.
Hill, Seth. 2017. “Changing Votes or Changing Voters: How Candidates and Election Context Swing Voters and Mobilize the Base.” Electoral Studies 48: 131–48.
Hill, Seth J., and Huber, Gregory A.. 2017. “Representativeness and Motivations of the Contemporary Donorate: Results from Merged Survey and Administrative Records.” Political Behavior 39: 329.
Hof, Michel H. P., and Zwinderman, Aeilko H.. 2012. “Methods for Analyzing Data from Probabilistic Linkage Strategies Based on Partially Identifying Variables.” Statistics in Medicine 31: 4231–42.
Imai, Kosuke, and Tingley, Dustin. 2012. “A Statistical Method for Empirical Testing of Competing Theories.” American Journal of Political Science 56: 218–36.
Jaro, Matthew. 1972. “UNIMATCH-A Computer System for Generalized Record Linkage Under Conditions of Uncertainty.” Technical Report, Spring Joint Computer Conference.
Jaro, Matthew. 1989. “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida.” Journal of the American Statistical Association . 84: 414–20.
Jutte, Douglas P., Roos, Leslie L., and Browne, Marni D.. 2011. “Administrative Record Linkage as a Tool for Public Health Research.” Annual Review of Public Health 32: 91108.
Kim, Gunky, and Chambers, Raymond. 2012. “Regression Analysis under Incomplete Linkage.” Computational Statistics and Data Analysis 56: 2756–70.
Lahiri, Partha, and Larsen, Michael D.. 2005. “Regression Analysis with Linked Data.” Journal of the American Statistical Association 100: 222–30.
Larsen, Michael D., and Rubin, Donald B.. 2001. “Iterative Automated Record Linkage Using Mixture Models.” Journal of the American Statistical Association 96: 3241.
McLaughlan, Geoffrey, and Peel, David. 2000. Finite Mixture Models. New York: John Wiley & Sons.
McVeigh, Brendan S., and Murray, Jared S.. 2017. “Practical Bayesian Inference for Record Linkage.” Technical Report, Carnegie Mellon University.
Meredith, Marc, and Morse, Michael. 2014. “Do Voting Rights Notification Laws Increase Ex-Felon Turnout?The ANNALS of the American Academy of Political and Social Science 651: 220–49.
Mummolo, Jonathan, and Nall, Clayton. 2016. “Why Partisans Don’t Sort: The Constraints on Political Segregation.” Journal of Politics 79: 4559.
Murray, Jared S. 2016. “Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering.” Journal of Privacy and Confidentiality 7: 324.
Neter, John, Maynes, E. Scott, and Ramanathan, R.. 1965. “The Effect of Mismatching on the Measurement of Response Errors.” Journal of the American Statistical Association 60: 1005–27.
Ong, Toan C., Mannino, Michael V., Schilling, Lisa M., and Kahn, Michael G.. 2014. “Improving Record Linkage Performance in the Presence of Missing Linkage Data.” Journal of Biomedical Informatics 52: 4354.
Richman, Jesse T., Chattha, Gulshan A., and Earnest, David C.. 2014. “Do Non-Citizens Vote in U.S. Elections?Electoral Studies 36: 149–57.
Ridder, Geert, and Moffitt, Robert. 2007. Handbook of Econometrics. Vol. 6. Elsevier Chapter The Econometrics of Data Combination, pp. 5469–547.
Sadinle, Mauricio. 2014. “Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach.” Annals of Applied Statistics . 8: 2404–34.
Sadinle, Mauricio. 2017. “Bayesian Estimation of Bipartite Matchings for Record Linkage.” Journal of the American Statistical Association 112: 600–12.
Sariyar, Murat, and Borg, Andreas. 2016. Record Linkage in R. R package. Version 0.4-10. http://cran.r-project.org/package=RecordLinkage.
Sariyar, Murat, Borg, Andreas, and Pommerening, Klaus. 2012. “Missing Values in Deduplication of Electronic Patient Data.” Journal of the American Medical Informatics Association 19: e76–82.
Scheuren, Fritz, and Winkler, William E.. 1993. “Regression Analysis of Data Files that Are Computer Matched.” Survey Methodology 19: 3958.
Scheuren, Fritz, and Winkler, William E.. 1997. “Regression Analysis of Data Files that Are Computer Matched II.” Survey Methodology . 23: 157–65.
Steorts, Rebecca C. 2015. “Entity Resolution with Empirically Motivated Priors.” Bayesian Analysis . 10: 849–75.
Steorts, Rebecca C., Ventura, Samuel L., Sadinle, Mauricio, and Fienberg, Stephen E.. 2014. “A Comparison of Blocking Methods for Record Linkage.” In Privacy in Statistical Databases, ed. Domingo-Ferrer, Josep. Springer, 253–68.
Tam Cho, Wendy, Gimpel, James, and Hui, Iris. 2013. “Voter Migration and the Geographic Sorting of the American Electorate.” Annals of the American Association of Geographers 103: 856–70.
Tancredi, Andrea, and Liseo, Brunero. 2011. “A Hierachical Bayesian Approach to Record Linkage and Population Size Problems.” Annals of Applied Statistics . 5: 1553–85.
Thibaudeau, Yves. 1993. “The Discrimination Power of Dependency Structures in Record Linkage.” Survey Methodology 19.
Winkler, William E. 1990. “String Comparator Metrics and Enhanced Decision Rules in the Fellegi–Sunter Model of Record Linkage.” In Proceedings of the Section on Survey Research Methods. American Statistical Association. https://www.iser.essex.ac.uk/research/publications/501361.
Winkler, William E. 1993. “Improved Decision Rules in the Fellegi–Sunter Model of Record Linkage.” In Proceedings of Survey Research Methods Section. American Statistical Association. http://ww2.amstat.org/sections/srms/Proceedings/papers/1993_042.pdf.
Winkler, William E. 2000. “Using the EM Algorithm for Weight Computation in the Felligi–Sunter Model of Record Linkage.” Technical Report No. RR2000/05, Statistical Research Division, Methodology and Standards Directorate, U.S. Bureau of the Census.
Winkler, William E. 2005. “Approximate String Comparator Search Strategies for Very Large Administrative Lists.” Research Report Series (Statistics) No. 2005-02, Statistical Research Division U.S. Census Bureau.
Winkler, William E. 2006a. “Automatic Estimation of Record Linkage False Match Rates.” In Proceedings of the Section on Survey Research Methods. American Statistical Association.
Winkler, William E. 2006b. “Overview of Record Linkage and Current Research Directions.” Technical Report, United States Bureau of the Census.
Winkler, William E., and Yancey, Willian. 2006. “Record Linkage Error-Rate Estimation without Training Data.” In Proceedings of the Section on Survey Research Methods. American Statistical Association.
Winkler, William E., Yancey, Willian, and Porter, E. H.. 2010. “Fast Record Linkage of Very Large Files in Support of the Decennial and Administrative Record Projects.” In Proceedings of the Secion on Survey Research Methods.
Yancey, Willian. 2005. “Evaluating String Comparator Performance for Record Linkage.” Research Report Series, Statistical Research Division U.S. Census Bureau.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

American Political Science Review
  • ISSN: 0003-0554
  • EISSN: 1537-5943
  • URL: /core/journals/american-political-science-review
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×
Type Description Title
PDF
Supplementary materials

Enamorado et al. supplementary material
Enamorado et al. supplementary material 1

 PDF (472 KB)
472 KB
PDF
Supplementary materials

Enamorado et al. supplementary material
Enamorado et al. supplementary material 2

 PDF (263 KB)
263 KB
UNKNOWN
Supplementary materials

Enamorado et al. Dataset
Dataset

 Unknown

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed