Skip to main content
×
×
Home

No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications

  • Erik de Vries (a1), Martijn Schoonvelde (a2) and Gijs Schumacher (a3)
Abstract

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models—such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.

Copyright
Corresponding author
Footnotes
Hide All

Authors’ note: Replication code and data are available at the Political Analysis Dataverse (De Vries, Schoonvelde, and Schumacher 2018) while the supplementary materials for this article are available on the Political Analysis web site. The authors would like to thank James Cross, Aki Matsuo, Christian Rauh, Damian Trilling, Mariken van der Velden and Barbara Vis for helpful comments and suggestions. GS and MS acknowledge funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 649281, EUENGAGE. EdV acknowledges funding for a research assistantship from the Access Europe (since 2018: UVAccess Europe) research center at the University of Amsterdam.

Contributing Editor: Jonathan N. Katz

Footnotes
References
Hide All
Agarwal, Apoorv, Xie, Boyi, Vovsha, Ilia, Rambow, Owen, and Passonneau, Rebecca. 2011. Sentiment analysis of twitter data. In Proceedings of the workshop on languages in social media . Association for Computational Linguistics, pp. 3038.
Aharoni, Roee. 2015. Automatic detection of machine translated text and translation quality estimation. PhD thesis, Department of Computer Science, Bar-Ilan University Ramat Gan, Israel.
Balahur, Alexandra, and Turchi, Marco. 2014. Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Computer Speech and Language 28(1):5675.
Benoit, Kenneth, and Nulty, Paul. 2013. Quanteda: quantitative analysis of textual data. http://quanteda.io. An R library for Managing and Analyzing Text.
Bird, S., Klein, E., and Loper, E.. 2009. Natural language processing with Python . O’Reilly Media, Inc.
Blei, David M., Ng, Andrew Y., and Jordan, Michael I.. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3(1):9931022.
Chuang, Jason, Gupta, Sonal, Manning, Christopher, and Heer, Jeffrey. 2013. Topic model diagnostics: assessing domain relevance via topical alignment. In Proceedings of the 30th international conference on machine learning (ICML-13) , pp. 612620.
Courtney, Michael, Breen, Michael, McMenamin, Iain, and McNulty, Gemma. 2017. Deductive validation of machine translation for text analysis in comparative politics. Working Paper, Dublin City University.
De Vries, Erik, Schoonvelde, Martijn, and Schumacher, Gijs. 2018. Replication data for: no longer lost in translation: evidence that Google Translate works for comparative bag-of-words text applications, https://doi.org/10.7910/DVN/VKMY6N, Harvard Dataverse, V1.
Denny, Matthew James, and Spirling, Arthur. 2018. Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis 26(2):168189.
Gray, Russell D., and Atkinson, Quentin D.. 2003. Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426:435439.
Greene, Zac, Ceron, Andrea, Schumacher, Gijs, and Fazekas, Zoltán. 2016. The nuts and bolts of automated text analysis. Comparing different document pre-processing techniques in four countries. Open Science Framework.
Grimmer, Justin, and Stewart, Brandon M.. 2013. Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3):267297.
Hampshire, Stephen, and Salvia, Carmen Porta. 2010. Translation and the internet: evaluating the quality of free online machine translators. Quaderns: revista de traducció 17:197209.
Hornik, Kurt, and Grün, Bettina. 2011. Topicmodels: an R package for fitting topic models. Journal of Statistical Software 40(13):130.
Kaljahi, Zadeh, and Samad, Rasoul. 2015. The role of syntax and semantics in machine translation and quality estimation of machine-translated user-generated content. PhD thesis, Dublin City University. Available at http://doras.dcu.ie/20499/.
Koehn, Philipp. 2005. Europarl: a parallel corpus for statistical machine translation. In MT summit , vol. 5, pp. 7986.
Koehn, Philipp, and Monz, Christof. 2006. Manual and automatic evaluation of machine translation between European languages. In Proceedings of the workshop on statistical machine translation . Association for Computational Linguistics, pp. 102121.
Loaiciga, Sharid, Meyer, Thomas, and Popescu-Belis, Andrei. 2014. English–French verb phrase alignment in europarl for tense translation modeling. In LREC , pp. 674681.
Lotz, Susan, and Van Rensburg, Alta. 2014. Translation technology explored: has a three-year maturation period done Google Translate any good? Stellenbosch Papers in Linguistics Plus 43:235259.
Lucas, Christopher, Nielsen, Richard A., Roberts, Margaret E., Stewart, Brandon M., Storer, Alex, and Tingley, Dustin. 2015. Computer-assisted text analysis for comparative politics. Political Analysis 23(2):254277.
Popescu-Belis, Andrei, Meyer, Thomas, Liyanapathirana, Jeevanthi, Cartoni, Bruno, and Zufferey, Sandrine. 2012. Discourse-level annotation over europarl for machine translation: connectives and pronouns. In Proceedings of the eighth international conference on language resources and evaluation (LREC). Number EPFL-CONF-192582 .
Roberts, Margaret E., Stewart, Brandon M., and Airoldi, Edoardo M.. 2016. A model of text for experimentation in the social sciences. Journal of the American Statistical Association 111(515):9881003.
Scarton, Carolina, and Specia, Lucia. 2014. Document-level translation quality estimation: exploring discourse and pseudo-references. In The 17th annual conference of the European association for machine translation , pp. 101108.
Schwarz, Daniel, Traber, Denise, and Benoit, Kenneth. 2017. Estimating intra-party preferences: comparing speeches to votes. Political Science Research and Methods 5(2):379396.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Political Analysis
  • ISSN: 1047-1987
  • EISSN: 1476-4989
  • URL: /core/journals/political-analysis
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×
MathJax

Keywords

Type Description Title
UNKNOWN
Supplementary materials

de Vries et al. supplementary material
Online Appendix

 Unknown (159 KB)
159 KB

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed