Skip to main content
    • Aa
    • Aa

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

  • Justin Grimmer (a1) and Brandon M. Stewart (a2)

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

Corresponding author
e-mail: (corresponding author)
Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

William S. Cleveland 1979. Robust locally weighted regression and scatterplots. Journal of the American Statistical Association 74(368): 829–36.

Joshua Clinton , Simon Jackman , and Douglas Rivers . 2004. The statistical analysis of roll call data. American Political Science Review 98(02): 355–70.

Daniel Diermeier , Jean-Francois Godbout , Bei Yu , and Stefan Kaufmann . 2011. Language and ideology in Congress. British Journal of Political Science 42(1): 3155.

Bradley Efron , and Gail Gong . 1983. A leisurely look at the bootstrap, the jackknife, and cross-validation. American Statistician 37(1): 3648.

Brendan Frey , and Delbert Dueck . 2007. Clustering by passing messages between data points. Science 315(5814): 972–6.

Elisabeth Gerber , and Jeff Lewis . 2004. Beyond the median: Voter preferences, district heterogeneity, and political representation. Journal of Political Economy 112(6): 1364–83.

Paul Kellstedt . 2000. Media framing and the dynamics of racial policy preferences. American Journal of Political Science 44(2): 245–60.

Jon Krosnick . 1999. Survey research. Annual Review of Psychology 50(1): 537–67.

Michael Laver , Kenneth Benoit , and John Garry . 2003. Extracting policy positions from political texts using words as data. American Political Science Review 97(02): 311–31.

Will Lowe , Ken Benoit , Slava Mihaylov , and M. Laver 2011. Scaling policy preferences from coded political texts. Legislative Studies Quarterly 36(1): 123–55.

M. E. Maron , and J. L. Kuhns 1960. On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery 7(3): 216–44.

F. Mosteller , and D. L. Wallace 1963. Inference in an authorship problem. Journal of the American Statistical Association 58: 275309.

Arthur Spirling . 2012. US treaty-making with American Indians. American Journal of Political Science 56(1): 8497.

Brandon M. Stewart , and Yuri M. Zhukov 2009. Use of force and civil-military relations in Russia: An automated content analysis. Small Wars & Insurgencies 20: 319–43.

P. Turney , and M. L. Littman 2003. Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS) 21(4): 315–46.

W. N. Venables , and B. D. Ripley 2002. Modern applied statistics with S. 4th ed. New York: Springer.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Political Analysis
  • ISSN: 1047-1987
  • EISSN: 1476-4989
  • URL: /core/journals/political-analysis
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 110 *
Loading metrics...

Abstract views

Total abstract views: 393 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 30th May 2017. This data will be updated every 24 hours.