Hostname: page-component-77f85d65b8-5ngxj Total loading time: 0 Render date: 2026-03-29T12:59:51.745Z Has data issue: false hasContentIssue false

Probabilistic Record Linkage Using Pretrained Text Embeddings

Published online by Cambridge University Press:  28 August 2025

Joseph T. Ornstein*
Affiliation:
Assistant Professor, Department of Political Science, University of Georgia , Athens, GA, USA
Rights & Permissions [Opens in a new window]

Abstract

Pretrained text embeddings are a fast and scalable method for determining whether two texts have similar meaning, capturing not only lexical similarity, but semantic similarity as well. In this article, I show how to incorporate these measures into a probabilistic record linkage procedure that yields considerable improvements in both precision and recall over existing methods. The procedure even allows researchers to link datasets across different languages. I validate the approach with a series of political science applications, and provide open-source statistical software for researchers to efficiently implement the proposed method.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Political Methodology
Figure 0

Table 1 Examples where lexical similarity is a misleading measure of match quality.

Figure 1

Table 2 Performance metrics for city name merge across three algorithms.

Figure 2

Table 3 Multilingual record linkage application: splitting the Parlgov data into two sets with 4,972 observations each.

Figure 3

Figure 1 Estimated seat-weighted parliamentary ideology following merge (points) plotted over true values (lines).

Supplementary material: File

Ornstein supplementary material

Ornstein supplementary material
Download Ornstein supplementary material(File)
File 2.7 MB
Supplementary material: Link

Ornstein Dataset

Link