Skip to main content
×
Home
    • Aa
    • Aa

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

  • Justin Grimmer (a1) and Brandon M. Stewart (a2)
Abstract

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
      Available formats
      ×
      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your Dropbox account. Find out more about sending content to Dropbox.

      Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
      Available formats
      ×
      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your Google Drive account. Find out more about sending content to Google Drive.

      Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
      Available formats
      ×
Copyright
Corresponding author
e-mail: jgrimmer@stanford.edu (corresponding author)
Footnotes
Hide All

Authors' note: For helpful comments and discussions, we thank participants in Stanford University's Text as Data class, Mike Alvarez, Dan Hopkins, Gary King, Kevin Quinn, Molly Roberts, Mike Tomz, Hanna Wallach, Yuri Zhurkov, and Frances Zlotnick. Replication data are available on the Political Analysis Dataverse at http://hdl.handle.net/1902.1/18517. Supplementary materials for this article are available on the Political Analysis Web site.

Footnotes
Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

J. S. Armstrong 1967. Derivation of theory by means of factor analysis or Tom Swift and his electric factor analysis machine. The American Statistician 21(1): 1721.

Scott Ashworth , and Scott Bueno de Mesquita . 2006. Delivering the goods: Legislative particularism in different electoral and institutional settings. Journal of Politics 68(1): 168–79.

K. Benoit , M. Laver , and S. Mikhaylov 2009. Treating words as data with error: Uncertainty in text statements of policy positions. American Journal of Political Science 53(2): 495513.

David Blei . 2012. Probabilistic topic models. Communications of the ACM 55(4): 7784.

Ian Budge , and Paul Pennings . 2007. Do they work? Validating computerised word frequency estimates against policy series. Electoral Studies 26: 121–29.

Barry Burden , and Joseph Sanberg . 2003. Budget rhetoric in presidential campaigns from 1952 to 2000. Political Behavior 25(2): 97118.

William S. Cleveland 1979. Robust locally weighted regression and scatterplots. Journal of the American Statistical Association 74(368): 829–36.

William S. Cleveland 1979. Robust locally weighted regression and scatterplots. Journal of the American Statistical Association 74(368): 829–36.

Joshua Clinton , Simon Jackman , and Douglas Rivers . 2004. The statistical analysis of roll call data. American Political Science Review 98(02): 355–70.

Daniel Diermeier , Jean-Francois Godbout , Bei Yu , and Stefan Kaufmann . 2011. Language and ideology in Congress. British Journal of Political Science 42(1): 3155.

Daniel Diermeier , Jean-Francois Godbout , Bei Yu , and Stefan Kaufmann . 2011. Language and ideology in Congress. British Journal of Political Science 42(1): 3155.

T. Dietterich 2000. Ensemble methods in machine learning. Multiple Classifier Systems 115.

Bradley Efron , and Gail Gong . 1983. A leisurely look at the bootstrap, the jackknife, and cross-validation. American Statistician 37(1): 3648.

Andy Eggers , and Jens Hainmueller . 2009. MPs for sale? Returns to office in postwar British politics. American Political Science Review 103(04): 513–33.

Matthew Eshbaugh-Soha . 2010. The tone of local presidential news coverage. Political Communication 27(2): 121–40.

Brendan Frey , and Delbert Dueck . 2007. Clustering by passing messages between data points. Science 315(5814): 972–6.

Brendan Frey , and Delbert Dueck . 2007. Clustering by passing messages between data points. Science 315(5814): 972–6.

C. Gelpi , and P. D. Feaver 2002. Speak softly and carry a big stick? Veterans in the political elite and the American use of force. American Political Science Review 96(4): 779–94.

Elisabeth Gerber , and Jeff Lewis . 2004. Beyond the median: Voter preferences, district heterogeneity, and political representation. Journal of Political Economy 112(6): 1364–83.

Elisabeth Gerber , and Jeff Lewis . 2004. Beyond the median: Voter preferences, district heterogeneity, and political representation. Journal of Political Economy 112(6): 1364–83.

Justin Grimmer . 2010. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in senate press releases. Political Analysis 18(1): 135.

Justin Grimmer , and Gary King . 2011. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences 108(7): 2643–50.

David J. Hand 2006. Classifier technology and the illusion of progress. Statistical Science 21(1): 115.

Trevor Hastie , Robert Tibshirani , and Jerome Friedman . 2001. The elements of statistical learning. New York, NY: Springer.

Dustin Hillard , Stephen Purpura , and John Wilkerson . 2008. Computer-assisted topic classification for mixed-methods social science research. Journal of Information Technology & Politics 4(4): 3146.

A. K. Jain , M. N. Murty , and P. J. Flynn 1999. Data clustering: A review. ACM Computing Surveys 31(3): 264323.

Paul Kellstedt . 2000. Media framing and the dynamics of racial policy preferences. American Journal of Political Science 44(2): 245–60.

Jon Krosnick . 1999. Survey research. Annual Review of Psychology 50(1): 537–67.

Jon Krosnick . 1999. Survey research. Annual Review of Psychology 50(1): 537–67.

Michael Laver , and John Garry . 2000. Estimating policy positions from political texts. American Journal of Political Science 44(3): 619–34.

Michael Laver , Kenneth Benoit , and John Garry . 2003. Extracting policy positions from political texts using words as data. American Political Science Review 97(02): 311–31.

Tim Loughran , and Bill McDonald . 2011. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance 66(1): 3565.

Will Lowe . 2008. Understanding wordscores. Political Analysis 16(4): 356–71.

Will Lowe , Ken Benoit , Slava Mihaylov , and M. Laver 2011. Scaling policy preferences from coded political texts. Legislative Studies Quarterly 36(1): 123–55.

Will Lowe , Ken Benoit , Slava Mihaylov , and M. Laver 2011. Scaling policy preferences from coded political texts. Legislative Studies Quarterly 36(1): 123–55.

Christopher Manning , Prabhakar Raghavan , and Hinrich Schütze . 2008. Introduction to information retrieval. Cambridge, UK: Cambridge University Press.

M. E. Maron , and J. L. Kuhns 1960. On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery 7(3): 216–44.

Lanny Martin , and Georg Vanberg . 2007. A robust transformation procedure for interpreting political text. Political Analysis 16(1): 93100.

Burt Monroe , Michael Colaresi , and Kevin Quinn . 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372.

F. Mosteller , and D. L. Wallace 1963. Inference in an authorship problem. Journal of the American Statistical Association 58: 275309.

F. Mosteller , and D. L. Wallace 1963. Inference in an authorship problem. Journal of the American Statistical Association 58: 275309.

B. Pang , L. Lee , and S. Vaithyanathan 2002. Thumbs up?: Sentiment classification using machine learning techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing 10: 7986.

Martin Porter . 1980. An algorithm for suffix stripping. Program 14(3): 130–37.

Kevin Quinn . 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1): 209–28.

Jonathan Slapin , and Sven-Oliver Proksch . 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52(3): 705–22.

Arthur Spirling . 2012. US treaty-making with American Indians. American Journal of Political Science 56(1): 8497.

Arthur Spirling . 2012. US treaty-making with American Indians. American Journal of Political Science 56(1): 8497.

Arthur Spirling , and Iain McLean . 2007. UK OC OK? Interpreting optimal classification scores for the UK House of Commons. Political Analysis 15(1): 8596.

Brandon M. Stewart , and Yuri M. Zhukov 2009. Use of force and civil-military relations in Russia: An automated content analysis. Small Wars & Insurgencies 20: 319–43.

P. Turney , and M. L. Littman 2003. Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS) 21(4): 315–46.

Mark van der Laan , Eric Polley , and Alan Hubbard . 2007. Super learner. Statistical Applications in Genetics and Molecular Biology 6(1): 15446115.

W. N. Venables , and B. D. Ripley 2002. Modern applied statistics with S. 4th ed. New York: Springer.

W. N. Venables , and B. D. Ripley 2002. Modern applied statistics with S. 4th ed. New York: Springer.

Robert P. Weber 1990. Basic content analysis. Newbury Park, CA: Sage University Paper Series on Quantitative Applications in the Social Sciences.

Barry Weingast , Kenneth Shepsle , and Christopher Johnsen . 1981. The political economy of benefits and costs: A neoclassical approach to distributive politics. The Journal of Political Economy 89(4): 642.

Diana Evans Yiannakis . 1982. House members' communication styles: Newsletter and press releases. The Journal of Politics 44(4): 1049–71.

Political Analysis (2013) 21:350367

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Political Analysis
  • ISSN: 1047-1987
  • EISSN: 1476-4989
  • URL: /core/journals/political-analysis
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×
MathJax

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 409 *
Loading metrics...

Abstract views

Total abstract views: 826 *
Loading metrics...

* Views captured on Cambridge Core between 4th January 2017 - 19th August 2017. This data will be updated every 24 hours.