Hostname: page-component-6766d58669-tq7bh Total loading time: 0 Render date: 2026-05-16T20:23:01.452Z Has data issue: false hasContentIssue false

Multilanguage Word Embeddings for Social Scientists: Estimation, Inference, and Validation Resources for 157 Languages

Published online by Cambridge University Press:  27 December 2024

Elisa M. Wirsching*
Affiliation:
Postdoctoral Fellow, Center for the Study of Democratic Politics, Princeton University, Princeton, NJ, USA
Pedro L. Rodriguez
Affiliation:
Visiting Scholar, Center for Data Science, New York University, New York, NY, USA and International Faculty, Instituto de Estudios Superiores de Administración, Caracas, Venezuela
Arthur Spirling
Affiliation:
Professor, Department of Politics, Princeton University, Princeton, NJ, USA
Brandon M. Stewart
Affiliation:
Associate Professor, Sociology and Office of Population Research, Princeton University, Princeton, NJ, USA.
*
Corresponding author: Elisa M. Wirsching; Email: elisa.wirsching@nyu.edu
Rights & Permissions [Opens in a new window]

Abstract

Word embeddings are now a vital resource for social science research. However, obtaining high-quality training data for non-English languages can be difficult, and fitting embeddings therein may be computationally expensive. In addition, social scientists typically want to make statistical comparisons and do hypothesis tests on embeddings, yet this is nontrivial with current approaches. We provide three new data resources designed to ameliorate the union of these issues: (1) a new version of fastText model embeddings, (2) a multilanguage “a la carte” (ALC) embedding version of the fastText model, and (3) a multilanguage ALC embedding version of the well-known GloVe model. All three are fit to Wikipedia corpora. These materials are aimed at “low-resource” settings where the analysts lack access to large corpora in their language of interest or to the computational resources required to produce high-quality vector representations. We make these resources available for 40 languages, along with a code pipeline for another 117 languages available from Wikipedia corpora. We extensively validate the materials via reconstruction tests and other proofs-of-concept. We also conduct human crowdworker tests for our embeddings for Arabic, French, (traditional Mandarin) Chinese, Japanese, Korean, Russian, and Spanish. Finally, we offer some advice to practitioners using our resources.

Information

Type
Letter
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of The Society for Political Methodology
Figure 0

Figure 1 Reconstruction performance: cosine similarity between our ALC version of fastText and GloVe and those underlying architectures. Languages are ordered according to the mean accuracy for fastText. In theory, cosine similarities range between $-1$ and $1$, but empirically all estimates are positive.

Figure 1

Table 1 Nearest neighbors for English terms democracy and equality.

Figure 2

Table 2 Nearest neighbors for French terms nation and racisme.

Figure 3

Figure 2 Summary of crowdsourcing comparisons, all languages. Baseline is original fT (in figure (a)) and fT (in figure (b)).

Supplementary material: File

Wirsching et al. supplementary material

Wirsching et al. supplementary material
Download Wirsching et al. supplementary material(File)
File 12.6 MB