Cross-Lingual Classification of Political Texts Using Multilingual Sentence Embeddings

Hauke Licht

doi:10.1017/pan.2022.29

Cross-Lingual Classification of Political Texts Using Multilingual Sentence Embeddings

Published online by Cambridge University Press: 26 January 2023

Hauke Licht

Show author details

Hauke Licht*: Affiliation:
Cologne Center for Comparative Politics, Institute of Political Science and European Affairs, University of Cologne, Cologne, Germany. E-mail: hauke.licht@wiso.uni-koeln.de
*: Corresponding author Hauke Licht

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Established approaches to analyze multilingual text corpora require either a duplication of analysts’ efforts or high-quality machine translation (MT). In this paper, I argue that multilingual sentence embedding (MSE) is an attractive alternative approach to language-independent text representation. To support this argument, I evaluate MSE for cross-lingual supervised text classification. Specifically, I assess how reliably MSE-based classifiers detect manifesto sentences’ topics and positions compared to classifiers trained using bag-of-words representations of machine-translated texts, and how this depends on the amount of training data. These analyses show that when training data are relatively scarce (e.g., 20K or less-labeled sentences), MSE-based classifiers can be more reliable and are at least no less reliable than their MT-based counterparts. Furthermore, I examine how reliable MSE-based classifiers label sentences written in languages not in the training data, focusing on the task of discriminating sentences that discuss the issue of immigration from those that do not. This analysis shows that compared to the within-language classification benchmark, such “cross-lingual transfer” tends to result in fewer reliability losses when relying on the MSE instead of the MT approach. This study thus presents an important addition to the cross-lingual text analysis toolkit.

Keywords

multilingual embedding multilingual text analysis supervised machine learning

Information

Type: Article
Information: Political Analysis , Volume 31 , Issue 3 , July 2023 , pp. 366 - 379

DOI: https://doi.org/10.1017/pan.2022.29 [Opens in a new window]
Copyright: © The Author(s), 2023. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Edited by Jeff Gill

References

Artetxe, M., and Schwenk, H.. 2019. “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond.” Transactions of the Association for Computational Linguistics 7: 597–610. https://doi.org/10.1162/tacl_a_00288.CrossRef Google Scholar

Baden, C., Pipal, C., Schoonvelde, M., and van der Velden, M. A. C. G.. 2021. “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.” Communication Methods and Measures 16 (1): 1–8. https://doi.org/10.1080/19312458.2021.2015574.CrossRef Google Scholar

Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., and Nagler, J.. 2021. “Automated Text Classification of News Articles: A Practical Guide.” Political Analysis 29 (1): 19–42. https://doi.org/10.1017/pan.2020.8.CrossRef Google Scholar

Baumgartner, F. R., Breunig, C., and Grossman, E. (eds.). 2019. Comparative Policy Agendas: Theory, Tools, Data. Oxford: Oxford University Press.CrossRef Google Scholar

Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., and Mikhaylov, S.. 2016. “Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political Data.” American Political Science Review 110 (2): 278–295. https://doi.org/10.1017/S0003055416000058.CrossRef Google Scholar

Burscher, B., Vliegenthart, R., and De Vreese, C. H.. 2015. “Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?” The Annals of the American Academy of Political and Social Science 659 (1): 122–131. https://doi.org/10.1177/0002716215569441.CrossRef Google Scholar

Chan, C.-H., et al. 2020. “Reproducible Extraction of Cross-Lingual Topics (rectr).” Communication Methods and Measures 14 (4): 285–305. https://doi.org/10.1080/19312458.2020.1812555.CrossRef Google Scholar

Conneau, A., et al. 2020. “Unsupervised Cross-Lingual Representation Learning at Scale.” In Jurafsky, D., J. Chai, N. Schluter, and J. Tetreault (eds.). Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747.CrossRef Google Scholar

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A.. 2017. “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.” In Palmer, M., R. Hwa, S. Riedel (eds.). Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 670–680. Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1070.CrossRef Google Scholar

Courtney, M., Breen, M., McMenamin, I., and McNulty, G.. 2020. “Automatic Translation, Context, and Supervised Learning in Comparative Politics.” Journal of Information Technology & Politics 17 (3): 208–217. https://doi.org/10.1080/19331681.2020.1731245.CrossRef Google Scholar

D’Orazio, V., Landis, S. T., Palmer, G., and Schrodt, P.. 2014. “Separating the Wheat from the Chaff: Applications of Automated Document Classification using Support Vector Machines.” Political Analysis 22 (2): 224–242. https://doi.org/10.1093/pan/mpt030.CrossRef Google Scholar

De Vries, E., Schoonvelde, M., and Schumacher, G.. 2018. “No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications.” Political Analysis 26 (4): 417–430. https://doi.org/10.1017/pan.2018.26.CrossRef Google Scholar

Düpont, N., and Rachuj, M.. 2022. “The Ties That Bind: Text Similarities and Conditional Diffusion among Parties.” British Journal of Political Science 52 (2): 613–630. https://doi.org/10.1017/S0007123420000617.CrossRef Google Scholar

Fan, A., et al. 2021. “Beyond English-Centric Multilingual Machine Translation.” Journal of Machine Learning Research 22 (107): 1–48.Google Scholar

Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J.. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences of the United States of America 115 (16): E3635–E3644. https://doi.org/10.1073/pnas.1720347115.Google Scholar PubMed

Glavaš, G., Nanni, F., and Ponzetto, S. P.. 2017. “Cross-Lingual Classification of Topics in Political Texts.” In Hovy, D., S. Volkova, D. Bamman, D. Jurgens, B. O’Connor, O. Tsur, A. S. Doğruöz (eds.). Proceedings of the Second Workshop on NLP and Computational Social Science, 42–46. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-2906.CrossRef Google Scholar

Grimmer, J., and Stewart, B. M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–297. https://doi.org/10.1093/pan/mps028.CrossRef Google Scholar

Harris, Z. S. 1954. “Distributional Structure.” WORD 10 (2–3): 146–162. https://doi.org/10.1080/00437956.1954.11659520.CrossRef Google Scholar

Hillard, D., Purpura, S., and Wilkerson, J.. 2008. “Computer-Assisted Topic Classification for Mixed-Methods Social Science Research.” Journal of Information Technology & Politics 4 (4): 31–46. https://doi.org/10.1080/19331680801975367.CrossRef Google Scholar

Laver, M., Benoit, K., and Garry, J.. 2003. “Extracting Policy Positions from Political Texts using Words as Data.” The American Political Science Review 97 (2): 311–331.CrossRef Google Scholar

Lehmann, P., and Zobel, M.. 2018. “Positions and Saliency of Immigration in Party Manifestos: A Novel Dataset Using Crowd Coding.” European Journal of Political Research 57 (4): 1056–1083.CrossRef Google Scholar

Licht, H. 2022a. “Replication Data for: Cross-Lingual Classification of Political Texts using Multilingual Sentence Embeddings.” Code Ocean V1. https://doi.org/10.24433/CO.5199179.v1.CrossRef Google Scholar

Licht, H. 2022b. “Replication Data for: Cross-Lingual Classification of Political Texts using Multilingual Sentence Embeddings.” Harvard Dataverse V1. https://doi.org/10.7910/DVN/OLRTXA.CrossRef Google Scholar

Lind, F., Eberl, J.-M., Eisele, O., Heidenreich, T., Galyga, S., and Boomgaarden, H. G.. 2021a. “Building the Bridge: Topic Modeling for Comparative Research.” Communication Methods and Measures 16: 96–114. https://doi.org/10.1080/19312458.2021.1965973.CrossRef Google Scholar

Lind, F., Eberl, J.-M., Heidenreich, T., and Boomgaarden, H. G.. 2019. “When the Journey Is as Important as the Goal: A Roadmap to Multilingual Dictionary Construction.” International Journal of Communication 13: 21.Google Scholar

Lind, F., Heidenreich, T., Kralj, C., and Boomgaarden, H. G.. 2021b. “Greasing the Wheels for Comparative Communication Research: Supervised Text Classification for Multilingual Corpora.” Computational Communication Research 3 (3): 1–30. https://doi.org/10.5117/CCR2021.3.001.LIND.CrossRef Google Scholar

Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., and Tingley, D.. 2015. “Computer-Assisted Text Analysis for Comparative Politics.” Political Analysis 23 (2): 254–277. https://doi.org/10.1093/pan/mpu019.CrossRef Google Scholar

Maier, D., Baden, C., Stoltenberg, D., De Vries-Kedem, M., and Waldherr, A.. 2021. “Machine Translation vs. Multilingual Dictionaries Assessing Two Strategies for the Topic Modeling of Multilingual Text Collections.” Communication Methods and Measures 16: 19–38. https://doi.org/10.1080/19312458.2021.1955845.CrossRef Google Scholar

Mikhaylov, S., Laver, M., and Benoit, K. R.. 2012. “Coder Reliability and Misclassification in the Human Coding of Party Manifestos.” Political Analysis 20 (1): 78–91. https://doi.org/10.1093/pan/mpr047.CrossRef Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J.. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Burges, C.J., L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.). Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc. Google Scholar

Osnabrügge, M., Ash, E., and Morelli, M.. 2021. “Cross-Domain Topic Classification for Political Texts.” Political Analysis First view: 1–22. https://doi.org/10.1017/pan.2021.37.CrossRef Google Scholar

Pennington, J., Socher, R., and Manning, C.. 2014. “GloVe: Global Vectors for Word Representation.” In Moschitti, A., B. Pang, W. Daelemans (eds.). Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.CrossRef Google Scholar

Proksch, S.-O., Lowe, W., Wäckerle, J., and Soroka, S.. 2019. “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches.” Legislative Studies Quarterly 44 (1): 97–131. https://doi.org/10.1111/lsq.12218.CrossRef Google Scholar

Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., and Radev, D. R.. 2010. “How to Analyze Political Attention with Minimal Assumptions and Costs.” American Journal of Political Science 54 (1): 209–228. https://doi.org/10.1111/j.1540-5907.2009.00427.x.CrossRef Google Scholar

Reber, U. 2019. “Overcoming Language Barriers: Assessing the Potential of Machine Translation and Topic Modeling for the Comparative Analysis of Multilingual Text Corpora.” Communication Methods and Measures 13 (2): 102–125. https://doi.org/10.1080/19312458.2018.1555798.CrossRef Google Scholar

Reimers, N., and Gurevych, I.. 2020. “Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.” In Webber, B., T. Cohn, Y. He, Y. Liu (eds.). Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 4512–4525. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.365.CrossRef Google Scholar

Rodman, E. 2020. “A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors.” Political Analysis 28 (1): 87–111. https://doi.org/10.1017/pan.2019.23.CrossRef Google Scholar

Rodriguez, P. L., and Spirling, A.. 2021. “Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research.” The Journal of Politics 84 (1): 101–115. https://doi.org/10.1086/715162.CrossRef Google Scholar

Ruder, S., Peters, M. E., Swayamdipta, S., and Wolf, T.. 2019. “Transfer Learning in Natural Language Processing.” In Sarkar, A., and M. Strube (eds.). Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, 15–18. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-5004.CrossRef Google Scholar

Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, Š., and Sedlmair, M.. 2018. “More than Bags of Words: Sentiment Analysis with Word Embeddings.” Communication Methods and Measures 12 (2–3): 140–157. https://doi.org/10.1080/19312458.2018.1455817.CrossRef Google Scholar

Ruedin, D., and Morales, L.. 2019. “Estimating Party Positions on Immigration: Assessing the Reliability and Validity of Different Methods.” Party Politics 25 (3): 303–314. https://doi.org/10.1177/1354068817713122.CrossRef Google Scholar

Volkens, A., et al. 2020. The Manifesto Data Collection. Manifesto Project (MRG/CMP/MARPOR). Version 2020a. Berlin: Wissenschaftszentrum Berlin für Sozialforschung (WZB). https://doi.org/10.25522/manifesto.mpds.2020a.CrossRef Google Scholar

Windsor, L. C., Cupit, J. G., and Windsor, A. J.. 2019. “Automated Content Analysis across Six Languages.” PLoS One 14 (11): e0224425. https://doi.org/10.1371/journal.pone.0224425.CrossRef Google Scholar PubMed

Yang, Y., et al. 2020. “Multilingual Universal Sentence Encoder for Semantic Retrieval.” In Celikyilmaz, A., T.-H. Wen (eds.). Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 87–94. https://doi.org/10.18653/v1/2020.acl-demos.12.CrossRef Google Scholar

Licht Dataset

Dataset

https://doi.org/10.7910/DVN/OLRTXA

Link

Licht supplementary material

PDF 366 KB

Article contents

Cross-Lingual Classification of Political Texts Using Multilingual Sentence Embeddings

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Licht Dataset

Licht supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests