Skip to main content

Computer-Assisted Text Analysis for Comparative Politics

  • Christopher Lucas (a1), Richard A. Nielsen (a2), Margaret E. Roberts (a3), Brandon M. Stewart (a4), Alex Storer (a5) and Dustin Tingley (a6)...

Recent advances in research tools for the systematic analysis of textual data are enabling exciting new research throughout the social sciences. For comparative politics, scholars who are often interested in non-English and possibly multilingual textual datasets, these advances may be difficult to access. This article discusses practical issues that arise in the processing, management, translation, and analysis of textual data with a particular focus on how procedures differ across languages. These procedures are combined in two applied examples of automated text analysis using the recently introduced Structural Topic Model. We also show how the model can be used to analyze data that have been translated into a single language via machine translation tools. All the methods we describe here are implemented in open-source software packages available from the authors.

    • Send article to Kindle

      To send this article to your Kindle, first ensure is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the or variations. ‘’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Computer-Assisted Text Analysis for Comparative Politics
      Available formats
      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Computer-Assisted Text Analysis for Comparative Politics
      Available formats
      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Computer-Assisted Text Analysis for Comparative Politics
      Available formats
Corresponding author
e-mail: (corresponding author)
Hide All

Authors' note: Our thanks to Sam Brotherton and Jetson Leder-Luis for research assistance and Amy Catilinac for discussion about text analyses in comparative politics. We also thank Christopher Blattman, Dan Corstange, Macartan Humphreys, Amaney Jamal, Gary King, Helen Milner, Tamar Mitts, Brendan O’Connor, Arthur Spirling, and the Columbia University Comparative Politics Workshop for comments. Our software discussed in this article is open source and available.

Hide All
Alfonseca, E., Bilac, S., and Pharies, S. 2008. Decompounding query keywords from compounding languages. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, 253–256. Association for Computational Linguistics.
Barberá, P. 2012. Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. In APSA 2012 Annual Meeting Paper.
Baturo, A., and Mikhaylov, S. 2013. Life of Brian revisited: Assessing informational and non-informational leadership tools. Political Science Research and Methods 1(01): 139–57.
Blei, D. M. 2012. Probabilistic topic models. Communications of the ACM 55(4): 7784.
Blei, D. M., and Lafferty, J. D. 2007. A correlated topic model of science. Annals of Applied Statistics 1(1): 1735.
Boyd-Graber, J., and Blei, D. M. 2009. Multilingual topic models for unaligned text. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 7582. AUAI Press.
Boyd-Graber, J., and Resnik, P. 2010. Holistic sentiment analysis across languages: Multilingual supervised latent dirichlet allocation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 4555. Association for Computational Linguistics.
Brachman, J. 2009. Global Jihadism. New York: Routledge.
Brady, H. E., and Collier, D. 2010. Rethinking social inquiry: Diverse tools, shared standards. Lanham, MD: Rowman & Littlefield.
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. 1990. A statistical approach to machine translation. Computational Linguistics 16(2): 7985.
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2): 263311.
Budge, I., Hans-Dieter, K., Andrea, V., Judith, B., and Eric, T. 2001. Mapping Policy Preferences: Estimates for Parties, Electors, and Governments 1945–1998. Oxford: Oxford University Press, Oxford, UK.
Campbell, R. S., and Pennebaker, J. W. 2003. The secret life of pronouns flexibility in writing style and physical health. Psychological Science 14(1): 6065.
Catalinac, A. 2014. Pork to policy: The Rise of National Security in Elections in Japan, unpublished manuscript.
Cheng, K.-S., Young, G. H., and Wong, K.-F. 1999. A study on word-based and integral-bit Chinese text compression algorithms. Journal of the American Society for Information Science 50(3): 218–28.
Chiozza, G. 2009. Anti-Americanism and the American world order. Baltimore: Johns Hopkins University Press.
Coscia, M., and Rios, V. 2012. Knowing where and how criminal organizations operate using web content. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 1412–1421. ACM.
Eggers, A., and Spirling, A. 2011. Partisan convergence in executive-legislative interactions modeling debates in the House of Commons, 1832–1915. Unpublished manuscript.
Farrell, H., and Finnemore, M. 2013. The end of hypocrisy: American foreign policy in the age of leaks. Foreign Affairs 92:22.
Feinerer, I., Hornik, K., and Meyer, D. 2008. Text mining infrastructure in R. Journal of Statistical Software 25(5): 154.
Fokkens, A., Van Erp, M., Postma, M., Pedersen, T., Vossen, P., and Freire, N. 2013. Offspring from reproduction problems: What replication failure teaches us. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1691–1701, Sofia, Bulgaria, August. Association for Computational Linguistics.
George, A., and Bennett, A. 2005. Case studies and theory development in the social sciences. Cambridge, MA: MIT Press.
Griffiths, T. L., and Steyvers, M. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(Suppl 1): 5228–235.
Grimmer, J. 2010. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis 18(1):1.
Grimmer, J., and Stewart, B. M. 2013. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3): 267–97.
Habash, N., and Hu, J. 2009. Improving Arabic-Chinese statistical machine translation using English as pivot language. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 173–81. Association for Computational Linguistics.
Harman, D. 1991. How effective is suffixing? JASIS 42(1): 715.
Hollink, V., Kamps, J., Monz, C., and De Rijke, M. 2004. Monolingual document retrieval for European languages. Information Retrieval 7(1–2): 3352.
Hu, Y., Zhai, K., Eidelman, V., and Boyd-Graber, J. 2014. Polylingual tree-based topic models for translation domain adaptation. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers):1166–1176.
Hull, D. A. 1996. Stemming algorithms: A case study for detailed evaluation. JASIS 47(1): 7084.
Jamal, A., Keohane, R. O., Romney, D., and Tingley, D. n.d. Anti-Americanism or anti-interventionism? Evidence from the Arabic Twitter universe. Perspectives on Politics. Forthcoming.
Katzenstein, P. J., and Keohane, R. O. 2007. Varieties of anti-Americanism: A framework for analysis. In Anti-Americanisms in world politics, eds. Katzenstein, P. J. and Keohane, R. O., 938. Ithaca: Cornell University Press.
King, G., Pan, J., and Roberts, M. E. 2013. How censorship in China allows government criticism but silences collective expression. American Political Science Review 107:118.
Koehn, P. 2009. Statistical machine translation. Cambridge, UK: Cambridge University Press.
Krovetz, R. J. 1995. Word-sense disambiguation for large text databases PhD thesis, University of Massachusetts, Amherst.
Laver, M., Benoit, K., and Garry, J. 2003. Extracting policy positions from political texts using words as data. American Political Science Review 97(02): 311–31.
Lunde, K. 2009. CJKV information processing. New York, NY: O’Reilly Media, Inc.
Lynch, M. 2007. Anti-Americanism in the Arab world. In Anti-Americanisms in world politics, eds. Katzenstein, P. J. and Keohane, R. O., 196224. Ithaca: Cornell University Press.
Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to information retrieval, Vol. 1. Cambridge: Cambridge University Press.
McCallum, A. K. 2002. Mallet: A machine learning for language toolkit. Available at
McCants, W. 2006. Militant ideology atlas. Technical report, Combating Terrorism Center, U.S. Military Academy.
Miller, M. C. 2013. Wronged by empire: Post-imperial ideology and foreign policy in India and China. Stanford, CA: Stanford University Press.
Mimno, D., Wallach, H. M., Naradowsky, J., Smith, D. A., and McCallum, A. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, 880–889. Association for Computational Linguistics.
Mosteller, F., and Wallace, D. L. 1963. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association 58(302): 275309.
Nielsen, R. 2013. The lonely Jihadist: Weak networks and the radicalization of Muslim clerics. PhD Thesis, Harvard University. Ann Arbor: ProQuest/UMI (Publication No. 3567018).
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, 311–318. Association for Computational Linguistics.
Paul, M., Yamamoto, H., Sumita, E., and Nakamura, S. 2009. On the importance of pivot language selection for statistical machine translation. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 221224. Association for Computational Linguistics.
Quinn, K., Monroe, B., Colaresi, M., Crespin, M., and Radev, D. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1): 209228.
Roberts, M. E., Stewart, B. M., and Airoldi, E. 2015. A model of text for experimentation in the social sciences. Unpublished manuscript.
Roberts, M. E., Stewart, B. M., and Tingley, D. 2014. stm: R package for structural topic models. R package version 0.6.21. software package
Roberts, M. E., Stewart, B. M., Tingley, D., and Airoldi, E. M. 2013. The structural topic model and applied social science. Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation.
Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S., Albertson, B., and Rand, D. 2014. Structural topic models for open-ended survey responses. American Journal of Political Science 58(4): 10641082.
Rubin, B. 2002. The real roots of Arab anti-Americanism. Foreign Affairs 81(6): 7385.
Salton, G. 1989. Automatic text processing: The transformation, analysis, and retrieval of information by computer. Boston, MA: Addison-Wesley.
Schonhardt-Bailey, C. 2006. From the Corn Laws to free trade [electronic resource]: Interests, ideas, and institutions in historical perspective. Cambridge, MA: MIT Press.
Schrodt, P. A., and Gerner, D. J. 1994. Validity assessment of a machine-coded event data set for the Middle East, 1982–92. American Journal of Political Science 38(3): 825854.
Slapin, J. B., and Proksch, S.-O. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52(3): 705722.
Stewart, B. M., and Zhukov, Y. M. 2009. Use of force and civil-military relations in Russia: An automated content analysis. Small Wars & Insurgencies 20(2): 319343.
Stockmann, D. 2012. Media commercialization and authoritarian rule in China. New York, NY: Cambridge University Press.
Telhami, S. 2002. The stakes: America and the Middle East. Boulder, CO: Westview Press.
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter for Sighan Bakeoff 2005. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Vol. 171. Jeju Island, Korea.
Utiyama, M., and Isahara, H. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In 2007 Proceedings of NAACL/HLT, pp. 484491.
Van Atteveldt, W., Kleinnijenhuis, J., and Ruigrok, N. 2008. Parsing, semantic networks, and political authority using syntactic analysis to extract semantic relations from Dutch newspaper articles. Political Analysis 16(4): 428446.
Volkens, A., Lehmann, P., Merz, N., Regel, S., Werner, A., Lacewell, O., and Schultze, H. 2013. The manifesto data collection. In Manifesto Project (MRG/CMP/MARPOR). Berlin: Wissenschaftszentrum Berlin für Sozialforschung (WZB).
Zhao, B., and Xing, E. P. 2006. Bitam: Bilingual topic admixture models for word alignment. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 969–76. Association for Computational Linguistics.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Political Analysis
  • ISSN: 1047-1987
  • EISSN: 1476-4989
  • URL: /core/journals/political-analysis
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
Type Description Title
Supplementary materials

Lucas et al. supplementary material

 PDF (269 KB)
269 KB


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed