Hostname: page-component-89b8bd64d-5bvrz Total loading time: 0 Render date: 2026-05-06T12:15:59.286Z Has data issue: false hasContentIssue false

Validating open-source machine translation for quantitative text analysis

Published online by Cambridge University Press:  29 April 2026

Hauke Licht*
Affiliation:
Department of Political Science and Digital Science Center, University of Innsbruck, Innsbruck, Austria
Ronja Ida Sczepanski
Affiliation:
Centre for European Studies and Comparative Politics, Sciences Po, Paris, France
Moritz Laurer
Affiliation:
Cohere Inc., Toronto, ON, Canada Communication Science, Vrije Universiteit Amsterdam, Amsterdam, Netherlands
Ayjeren Bekmuratovna
Affiliation:
Independent researchers
*
Corresponding author: Hauke Licht; Email: hauke.licht@uibk.ac.at
Rights & Permissions [Opens in a new window]

Abstract

Machine translation (MT) is an essential tool in many multilingual computational text analysis applications. However, relying on commercial services like Google Translate or DeepL limits reproducibility and can be expensive. This paper assesses the viability of a reproducible, transparent, and affordable alternative: open-source MT models. We ask whether using open-source MT models instead of commercial services substantially changes the measurements obtained from multilingual corpora by extending an influential study by de Vries et al. and contributing an original study focusing on Transformer-based supervised text classification. Our findings reveal negligible differences in results between the two MT approaches, suggesting that open-source MT models are highly valuable tools for multilingual text analysis.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of EPS Academic Ltd.
Figure 0

Table 1. Comparison of costs in Euro and compute time arising when translating a fixed amount of text with commercial services or the M2M open-source MT model

Figure 1

Table 2. Summary statistics of cosine similarities between bag-of-words representations obtained from machine- and human-translated texts at the document level. Columns grouped by translation model

Figure 2

Table 3. Summary statistics of correlations between document-level topic proportion estimates obtained from machine- and human-translated texts. Columns grouped by translation model

Figure 3

Figure 1. Similarity of corpus-level topical prevalence between the human translated corpus and the Google Translate and OPUS-MT, respectively. Topical prevalence means the probability of the document being assigned to a specific topic.

Figure 4

Table 4. Datasets

Figure 5

Table 5. Tasks overview

Figure 6

Figure 2. Distribution of similarities of open-source MT models’ translations to DeepL translations by language and open-source MT model in sample of 500 sentences per language sampled from the CMP Translations corpus. Translation similarity measured with BERTScore at translation pair level. Note that no OPUS-MT translation to English was obtained for Greek, Lithuanian, Norwegian, Portuguese, Romanian, and Slovenian due to translation direction limitations.

Figure 7

Table 6. OLS coefficient estimates of the effect of using open-source vs. commercial machine translation models for translating input texts on classifiers’ language-specific out-of-sample classification performance (F1 score)

Figure 8

Figure 3. Predicted language-specific F1 scores by language and type of MT model. Estimates based on regression reported in Table C18.

Supplementary material: File

Licht et al. supplementary material

Licht et al. supplementary material
Download Licht et al. supplementary material(File)
File 17 MB
Supplementary material: Link

Licht et al. Dataset

Link