Hostname: page-component-77f85d65b8-v2srd Total loading time: 0 Render date: 2026-03-28T20:00:12.199Z Has data issue: false hasContentIssue false

Adaptive Fuzzy String Matching: How to Merge Datasets with Only One (Messy) Identifying Field

Published online by Cambridge University Press:  11 October 2021

Aaron R. Kaufman*
Affiliation:
Division of Social Sciences, New York University Abu Dhabi, Saadiyat Island, Abu Dhabi, United Arab Emirates. Email: AaronRKaufman.com, aaronkaufman@nyu.edu
Aja Klevs
Affiliation:
Center for Data Science, New York University, New York, NY, USA
*
Corresponding author Aaron R. Kaufman
Rights & Permissions [Opens in a new window]

Abstract

A single dataset is rarely sufficient to address a question of substantive interest. Instead, most applied data analysis combines data from multiple sources. Very rarely do two datasets contain the same identifiers with which to merge datasets; fields like name, address, and phone number may be entered incorrectly, missing, or in dissimilar formats. Combining multiple datasets absent a unique identifier that unambiguously connects entries is called the record linkage problem. While recent work has made great progress in the case where there are many possible fields on which to match, the much more uncertain case of only one identifying field remains unsolved: this fuzzy string matching problem, both its own problem and a component of standard record linkage problems, is our focus. We design and validate an algorithmic solution called Adaptive Fuzzy String Matching rooted in adaptive learning, and show that our tool identifies more matches, with higher precision, than existing solutions. Finally, we illustrate its validity and practical value through applications to matching organizations, places, and individuals.

Information

Type
Letter
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press on behalf of the Society for Political Methodology
Figure 0

Table 1 Each row indicates a possible matched pair of String 1 and String 2, and contains the true match status and four different string-distance metrics. In the first row, String 1 and String 2 are identical, so all distance scores are 0.

Figure 1

Figure 1 Match precision increases as the predicted match probability increases for the human-in-the-loop (HITL) model, the baseline model, and four constituent measures. At a confidence of 0.95 or greater, the HITL model achieves 88.0% precision.

Supplementary material: Link

Kaufman and Klevs Dataset

Link
Supplementary material: PDF

Kaufman and Klevs supplementary material

Kaufman and Klevs supplementary material

Download Kaufman and Klevs supplementary material(PDF)
PDF 852.4 KB