Hostname: page-component-77f85d65b8-8v9h9 Total loading time: 0 Render date: 2026-03-29T20:29:06.279Z Has data issue: false hasContentIssue false

Embedding Regression: Models for Context-Specific Description and Inference

Published online by Cambridge University Press:  19 January 2023

PEDRO L. RODRIGUEZ*
Affiliation:
New York University, United States
ARTHUR SPIRLING*
Affiliation:
New York University, United States
BRANDON M. STEWART*
Affiliation:
Princeton University, United States
*
Pedro L. Rodriguez, Visiting Scholar, Center for Data Science, New York University, United States; and International Faculty, Instituto de Estudios Superiores de Administración, Venezuela, pedro.rodriguez@nyu.edu.
Arthur Spirling, Professor of Politics and Data Science, Department of Politics, New York University, United States, arthur.spirling@nyu.edu.
Brandon M. Stewart, Associate Professor, Sociology and Office of Population Research, Princeton University, United States, bms4@princeton.edu.
Rights & Permissions [Opens in a new window]

Abstract

Social scientists commonly seek to make statements about how word use varies over circumstances—including time, partisan identity, or some other document-level covariate. For example, researchers might wish to know how Republicans and Democrats diverge in their understanding of the term “immigration.” Building on the success of pretrained language models, we introduce the à la carte on text (conText) embedding regression model for this purpose. This fast and simple method produces valid vector representations of how words are used—and thus what words “mean”—in different contexts. We show that it outperforms slower, more complicated alternatives and works well even with very few documents. The model also allows for hypothesis testing and statements about statistical significance. We demonstrate that it can be used for a broad range of important tasks, including understanding US polarization, historical legislative development, and sentiment detection. We provide open-source software for fitting the model.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of the American Political Science Association
Figure 0

Figure 1. Replication of Figure 3 in Rodman (2020) Adding ALC ResultsNote: ALC = ALC model, CHR = chronological model, and GS = gold standard.

Figure 1

Table 1. Nearest Neighbors for the 1855 Corpus

Figure 2

Table 2. Nearest Neighbors for the 2005 Corpus

Figure 3

Figure 2. Identification of Distinct ClustersNote: Each observation represents a single realization of a target word, either of trump or Trump. Misclassified instances refer to instances of either target word that were assigned the majority cluster of the opposite target word.

Figure 4

Figure 3. Cluster HomogeneityNote: Cluster homogeneity (in terms of Trump vs. trump) of k-means with two clusters of individual term instances embedded using different methods.

Figure 5

Table 3. Top 10 Nearest Neighbors Using Simple Averaging of Embeddings and ALC

Figure 6

Figure 4. Relative Semantic Shift from “Trump”Note: Values are the norm of $ \hat{\beta} $ and bootstrap confidence intervals. See SM Section J for full regression output. *** = statistically significant at 0.01 level.

Figure 7

Figure 5. Differences in Word Meaning by Gender and PartyNote: Generally, different genders in the same party have more similar understanding of a term than the same gender across parties. See SM Section J for full regression output. All coefficients are statistically significant at 0.01 level.

Figure 8

Table 4. Top 10 Nearest Neighbors for the Target Term “Immigration”

Figure 9

Table 5. Subset of Top Nearest Contexts For The Target Term “Immigration”

Figure 10

Figure 6. Cosine Similarity (LOESS Smoothed) between Various Words and “Immigration” at Each Percentile of NOMINATE ScoresNote: We mark the median Democrat and median Republican to help calibrate the scale. See SM Section J for full regression output.

Figure 11

Figure 7. Norm of the British and American Difference in Understanding of “Empire,” 1935–2010Note: Larger values imply the uses are more different. The dashed lines show the bootstrapped 95% CIs. See SM Section J for full regression output.

Figure 12

Figure 8. UK and US Discussions of “Empire” Diverged after 1950Note: Most US and UK nearest neighbors pre and post estimated breakpoint. * = statistically significant at 0.01 level.

Figure 13

Figure 9. Conservative Backbenchers Were Unsatisfied with Their Own Government’s EU Policy Prior to the ReferendumNote: Each column of the plot is a policy area (with the seed word used to calculate sentiment). Those areas are education (education), health (nhs), and the EU (eu). Note the middle-right plot: rank-and-file Conservative MP sentiment on EU policy is negatively correlated with the leadership’s sentiment.

Figure 14

Figure 10. Replication of Figure 9 Using a Dictionary ApproachNote: The sentiment patterns are less obvious.

Supplementary material: Link

Rodriguez et al. Dataset

Link
Supplementary material: PDF

Rodriguez et al. supplementary material

Rodriguez et al. supplementary material

Download Rodriguez et al. supplementary material(PDF)
PDF 1.6 MB
Submit a response

Comments

No Comments have been published for this article.