Hostname: page-component-89b8bd64d-7zcd7 Total loading time: 0 Render date: 2026-05-07T12:49:41.323Z Has data issue: false hasContentIssue false

No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications

Published online by Cambridge University Press:  11 September 2018

Erik de Vries
Affiliation:
Department of Media and Social Sciences, University of Stavanger, Stavanger, Norway. Email: erik.devries@uis.no
Martijn Schoonvelde
Affiliation:
Department of Political Science and Public Administration, Vrije Universiteit, Amsterdam, The Netherlands. Email: h.j.m.schoonvelde@vu.nl
Gijs Schumacher*
Affiliation:
Department of Political Science, University of Amsterdam, Amsterdam, The Netherlands. Email: g.schumacher@uva.nl
Rights & Permissions [Opens in a new window]

Abstract

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models—such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.

Information

Type
Articles
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s) 2018. Published by Cambridge University Press on behalf of the Society for Political Methodology.
Figure 0

Figure 1. Research design.Note: This figure shows the different steps of our research design. In both cases we start with non-English texts, which have been translated into English, either through Google Scholar or through EU-employed expert translators (Step 1). The English translations are then preprocessed and turned into TDMs (Step 2), on which we then estimate a topic model (Step 3). We then compare our four different outcome variables (Step 4). The comparisons are the following: Comparison 1: document-to-document comparison TDM similarity; Comparison 2: document-to-document comparison of topic distributions (topical prevalence); Comparison 3: topic-to-topic comparison of stem weights (topical content); Comparison 4: topic-to-topic comparison of topic distribution (topical prevalence).

Figure 1

Figure 2. Model harmonic mean.

Figure 2

Table 1. Comparisons between gold standard and machine-translated data.

Figure 3

Figure 3. Distribution of cosine similarity per language pair.

Figure 4

Table 2. Cosine similarity distribution per language.

Figure 5

Figure 4. Unique TDM features for gold standard and machine-translated corpora. Reading example: for the French language, the amount of overlapping features is around 28,000, while the total number of features is around 33,000 for the machine-translated documents and around 38,000 for the gold standard documents.

Figure 6

Figure 5. Similarity of document-level topical prevalence with equal number of topics.

Figure 7

Table 3. Similarity of document-level topical prevalence with equal number of topics.

Figure 8

Figure 6. Similarity of corpus-level topical prevalence with equal number of topics. Overall descriptives: $N=446$, $M=0.699$, $SD=0.321$.

Figure 9

Figure 7. Similarity of topical content with equal number of topics. Overall descriptives: $N=446$, $M=0.708$, $SD=0.345$.

Supplementary material: File

de Vries et al. supplementary material

Online Appendix

Download de Vries et al. supplementary material(File)
File 159 KB