Hostname: page-component-89b8bd64d-dvtzq Total loading time: 0 Render date: 2026-05-08T23:37:36.682Z Has data issue: false hasContentIssue false

Linking datasets on organizations using half a billion open-collaborated records

Published online by Cambridge University Press:  16 October 2024

Brian Libgober
Affiliation:
Department of Political Science and Institute of Policy Research, Northwestern University, Evanston, IL, USA
Connor T. Jerzak*
Affiliation:
Department of Government, The University of Texas at Austin, Austin, TX, USA
*
Corresponding author: Connor Jerzak; Email: connor.jerzak@austin.utexas.edu
Rights & Permissions [Opens in a new window]

Abstract

Scholars studying organizations often work with multiple datasets lacking shared identifiers or covariates. In such situations, researchers usually use approximate string (“fuzzy”) matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even where two strings appear similar to humans, fuzzy matching often struggles because it fails to adapt to the informativeness of the character combinations. In response, a number of machine learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. By leveraging information from the LinkedIn corpus regarding organizational name-to-name links, we incorporate trillions of name pair examples into various methods to enhance existing matching benchmarks and performance by explicitly maximizing match probabilities. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of EPS Academic Ltd
Figure 0

Table 1. Illustration of source data using three public figures

Figure 1

Table 2. Descriptive statistics for the LinkedIn data

Figure 2

Figure 1. Checkered flag diagram describing the organizational linkage problem.

Figure 3

Figure 2. A high-level illustration of the multi-level neural network's architecture. We learn from data how to represent, in a vector space, (a) the characters that constitute words and (b) the words that constitute organizational names. Each lower level is used to generate a higher-level representation in vector space via a flexible model, with better lower-level representations learned by tuning higher-order representations.

Figure 4

Figure 3. Visualizing the machine learning output: Similar organizational names are close in this vector space, which has been projected to two dimensions using PCA.

Figure 5

Figure 4. Visualizing the machine learning model output: Left. Fuzzy matching as a baseline generates distances between strings that have trouble distinguishing matches from non-matches in some cases Right. On average, organizational alias matches have higher match probabilities compared with the set of non-matches.

Figure 6

Figure 5. LinkedIn Name Network and Community Detection Algorithm. The figure illustrates how the community detection algorithm links organizational aliases for 11 aliases in the banking space. Left: A weighted graph derived from raw counts of connections in the LinkedIn database. Due to some stronger connections between some nodes, the figure suggests some clusters visually, but there are also connections across clusters, which makes clustering a non-trivial problem. Middle: The Markov clustering algorithm proceeds toward convergence. The connections between nodes are now stronger within and weaker across communities. A “Bank of America” community appears to have been detected. Right: Community detection has converged to three clusters. The tight connection between JP Morgan and its subsidiaries contrasts with what is found through word embeddings, where these names are rather distant in the machine learning-based vector space (compare the positioning of some of the same aliases in Figure 3).

Figure 7

Figure 6. Checkered flag diagrams illustrating a unified approach to name record linkage using the LinkedIn corpus, (a) Direct linkage through machine learning-optimized string matching, (b) Indirect linkage through a directory constructed using community detection.

Figure 8

Figure 7. We find that dataset linkage using any one of the approaches with the LinkedIn network obtains favorable performance relative to fuzzy string matching both when examining only the raw percentage of correct matches obtained (Left Panel) and when adjusting for the rate of false positives and false negatives in the F2 score (Right Panel). In both figures, higher values along the Y-axis are better. The “Bipartite” refers to the Bipartite network-based approaches to linkage. “ML” refers to the machine learning approach introduced above. “Fuzzy,” “DeezyMatch,” and “Lookup” refer to the string distance, machine learning, and network baselines. “Bipartite-ML” refer to the ensemble of “Bipartite” and “ML.” See Figure A.VII.1 for full results with Markov approaches included.

Figure 9

Table 3. Runtime on the meetings data analysis

Figure 10

Figure 8. The coefficient of log(Assets) for predicting log(1+Expenditures) using the ground truth data is about 2.5 (depicted by a bold gray line; 95 percent confidence interval is displayed using dotted gray lines). At its best point, fuzzy matching underestimates this quantity by about half. The LinkedIn-based matching algorithms recover the coefficient better. See Figure A.VII.3 for full results with Markov approaches included.

Figure 11

Figure 9. In this YCombinator example, we see that the network-based approaches offer no relative benefit in terms of true positives when adjusting for false positives, yet the machine learning approach that uses the LinkedIn corpus performs well over fuzzy matching. Higher values along the Y-axis are better. See Figure A.VII.2 for full results with Markov approaches included.

Figure 12

Table 4. Comparing different approaches to organizational record linkage

Supplementary material: File

Libgober and Jerzak supplementary material

Libgober and Jerzak supplementary material
Download Libgober and Jerzak supplementary material(File)
File 687.9 KB