Hostname: page-component-77c89778f8-vsgnj Total loading time: 0 Render date: 2024-07-24T19:25:51.391Z Has data issue: false hasContentIssue false

Automated Text Classification of News Articles: A Practical Guide

Published online by Cambridge University Press:  09 June 2020

Pablo Barberá*
Associate Professor of Political Science and International Relations, University of Southern California, Los Angeles, CA90089, USA. Email:
Amber E. Boydstun
Associate Professor of Political Science, University of California, Davis, CA95616, USA. Email:
Suzanna Linn
Liberal Arts Professor of Political Science, Department of Political Science, Penn State University, University Park, PA16802, USA. Email:
Ryan McMahon
PhD Graduate, Department of Political Science, Penn State University, University Park, PA16802, USA (now at Google). Email:
Jonathan Nagler
Professor of Politics and co-Director of the Center for Social Media and Politics, New York University, New York, NY10012, USA. Email:


Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.

Copyright © The Author(s) 2020. Published by Cambridge University Press on behalf of the Society for Political Methodology.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


Contributing Editor: Jeff Gill


Atkinson, M. L., Lovett, J., and Baumgartner, F. R.. 2014. “Measuring the Media Agenda.” Political Communication 31(2):355380.CrossRefGoogle Scholar
Bai, J., Song, D., Bruza, P., Nie, J.-Y., and Cao, G.. 2005. “Query Expansion Using Term Relationships in Language Models for Information Retrieval.” In Proceedings of the 14th ACM International Conference on Information and Knowledge Management , 688695. Bremen, Germany: Association for Computing Machinery.CrossRefGoogle Scholar
Barberá, P., Boydstun, A., Linn, S., McMahon, R., and Nagler, J.. 2020. “Replication Data for: Automated Text Classification of News Articles: A Practical Guide.” URL: doi:10.7910/DVN/MXKRDE, Harvard Dataverse, V1, UNF:6:AR3Usj7mJKo7lkT/YUsaXA== [fileUNF].Google Scholar
Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., and Mikhaylov, S.. 2016. “Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data.” American Political Science Review 110(2):278295.CrossRefGoogle Scholar
Blood, D. J., and Phillips, P. C. B.. 1997. “Economic Headline News on the Agenda: New Approaches to Understanding Causes and Effects.” In Communication and Democracy: Exploring the Intellectual Frontiers in Agenda-setting Theory , edited by McCombs, M., Shaw, D. L., and Weaver, D., 97113. New York: Routledge.Google Scholar
Bradburn, N. M., Sudman, S., and Wansink, B.. 2004. Asking Questions: The Definitive Guide to Questionnaire Design . San Francisco: John Wiley and Sons.Google Scholar
Caruana, R., and Niculescu-Mizil, A.. 2006. “An Empirical Comparison of Supervised Learning Algorithms.” In Proceedings of the 23rd International Conference on Machine Learning , 161168. Pittsburgh: Association for Computing Machinery.CrossRefGoogle Scholar
Condorcet, M. J. et al. . 1972. Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix, vol. 252 . Providence, RI: American Mathematical Society.Google Scholar
De Boef, S., and Kellstedt, P. M.. 2004. “The Political (and Economic) Origins of Consumer Confidence.” American Journal of Political Science 48(4):633649.CrossRefGoogle Scholar
Denny, M. J., and Spirling, A.. 2018. “Assessing the Consequences of Text Preprocessing Decisions.” Political Analysis 26:168189.Google Scholar
Doms, M. E., and Morin, N. J.. “Consumer sentiment, the economy, and the news media.” FRB of San Francisco Working Paper (2004–09), San Francisco: Federal Reserve Board.Google Scholar
Eshbaugh-Soha, M. 2010. “The Tone of Local Presidential News Coverage.” Political Communication 27(2):121140.CrossRefGoogle Scholar
Fan, D., Geddes, D., and Flory, F.. 2013. “The Toyota Recall Crisis: Media Impact on Toyota’s Corporate Brand Reputation.” Corporate Reputation Review 16(2):99117.Google Scholar
Fogarty, B. J. 2005. “Determining Economic News Coverage.” International Journal of Public Opinion Research 17(2):149172.CrossRefGoogle Scholar
Goidel, K., Procopio, S., Terrell, D., and Wu, H. D.. 2010. “Sources of Economic News and Economic Expectations.” American Politics Research 38(4):759777.CrossRefGoogle Scholar
Goidel, R. K., and Langley, R. E.. 1995. “Media Coverage of the Economy and Aggregate Economic Evaluations: Uncovering Evidence of Indirect Media Effects.” Political Research Quarterly 48(2):313328.CrossRefGoogle Scholar
Grimmer, J., and Stewart, B. M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3):267297.CrossRefGoogle Scholar
Grimmer, J., Messing, S., and Westwood, S. J.. 2012. “How Words and Money Cultivate a Personal Vote: The Effect of Legislator Credit Claiming on Constituent Credit Allocation.” American Political Science Review 106(04):703719.Google Scholar
Groves, R., Fowler, F. Jr, Couper, M. P., Lepkowski, J. M., Singer, E., and Tourangeau, R.. 2009. Survey Methodology . 2nd edn. Hoboken, NJ: Wiley.Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J.. 2009. “Unsupervised Learning.” In The Elements of Statistical Learning , edited by Hastie, T., Tibshirani, R., and Friedman, J., 485585. New York: Springer.CrossRefGoogle Scholar
Hillard, D., Purpura, S., and Wilkerson, J.. 2008. “Computer-assisted Topic Classification for Mixed-methods Social Science Research.” Journal of Information Technology & Politics 4(4):3146.CrossRefGoogle Scholar
Hopkins, D. J., Kim, E., and Kim, S.. 2017. “Does Newspaper Coverage Influence or Reflect Public Perceptions of the Economy? Research & Politics 4(4): 2053168017737900.CrossRefGoogle Scholar
James, G., Witten, D., Hastie, T., and Tibshirani, R.. 2013. An Introduction to Statistical Learning, vol. 6 . New York: Springer.CrossRefGoogle Scholar
Jurka, T. P., Collingwood, L., Boydstun, A. E., Grossman, E., and van Atteveldt, W.. 2013. “RTextTools: A Supervised Learning Package for Text Classification.” The R Journal 5(1):612.CrossRefGoogle Scholar
King, G., Lam, P., and Roberts, M.. 2016. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” Working Paper.CrossRefGoogle Scholar
Krippendorff, K. 2018. Content Analysis: An Introduction to its Methodology . 4th edn. Thousand Oaks, CA: Sage.Google Scholar
Laver, M., Benoit, K., and Garry, J.. 2003. “Extracting Policy Positions from Political Texts Using Words as Data.” American Political Science Review 97(02):311331.CrossRefGoogle Scholar
Lyon, A., and Pacuit, E.. 2013. “The Wisdom of Crowds: Methods of Human Judgement Aggregation.” In Handbook of Human Computation , edited by Michelucci, P., 599614. New York: Springer.CrossRefGoogle Scholar
Mitra, M., Singhal, A., and Buckley, C.. 1998. “Improving Automatic Query Expansion.” In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 206214. Melbourne, Australia: Association for Computing Machinery.Google Scholar
Monroe, B. L., Colaresi, M. P., and Quinn, K. M.. 2008. “Fightin’words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.” Political Analysis 16(4):372403.CrossRefGoogle Scholar
Muddiman, A., and Stroud, N. J.. 2017. “News Values, Cognitive Biases, and Partisan Incivility in Comment Sections.” Journal of Communication 67(4):586609.CrossRefGoogle Scholar
Page, S. E. 2008. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies-New Edition . Princeton, NJ: Princeton University Press.CrossRefGoogle Scholar
Raschka, S. 2015. Python Machine Learning . Birmingham: Packt Publishing Ltd.Google Scholar
Rocchio, J. J. 1971. The SMART Retrieval System—Experiments in Automatic Document Processing . Englewoods Cliffs, NJ: Prentice-Hall.Google Scholar
Schrodt, P.2011. Country Infro, 111216.txt. Scholar
Schütze, H., and Pedersen, J. O.. 1994. “A Cooccurrence-based Thesaurus and Two Applications to Information Retrieval.” Information Processing & Management 33(3):307318.CrossRefGoogle Scholar
Soroka, S. N., Stecula, D. A., and Wlezien, C.. 2015. “It’s (Change in) the (Future) Economy, Stupid: Economic Indicators, the Media, and Public Opinion.” American Journal of Political Science 59(2):457474.CrossRefGoogle Scholar
Stecula, D. A., and Merkley, E.. 2019. “Framing Climate Change: Economics, Ideology, and Uncertainty in American News Media Content from 1988 to 2014.” Frontiers in Communication 4(6):115.CrossRefGoogle Scholar
Sudman, S., Bradburn, N. M., and Schwartz, N.. 1995. Thinking about Answers: The Application of Cognitive Processes to Survey Methodology . San Francisco: Jossey-Bass.Google Scholar
Surowiecki, J. 2005. The Wisdom of the Crowds . New York: Anchor.Google Scholar
Tetlock, P. C. 2007. “Giving Content to Investor Sentiment: The Role of Media in the Stock Market.” Journal of Finance 62(3):11391168.CrossRefGoogle Scholar
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., and Kappas, A.. 2010. “Sentiment Strength Detection in Short Informal Text.” Journal of the American Society for Information Science and Technology 61(12):25442558.CrossRefGoogle Scholar
Wu, H. D., Stevenson, R. L., Chen, H.-C., and Güner, Z. N.. 2002. “The Conditioned Impact of Recession News: A Time-Series Analysis of Economic Communication in the United States, 1987–1996.” International Journal of Public Opinion Research 14(1):1936.CrossRefGoogle Scholar
Xu, J., and Croft, W. B.. 1996. “Query Expansion Using Local and Global Document Analysis.” In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 411. Zurich: Association for Computing Machinery.Google Scholar
Young, L., and Soroka, S.. 2012. “Affective News: The Automated Coding of Sentiment in Political Texts.” Political Communication 29(2):205231.Google Scholar
Supplementary material: Link

Barberá et al. Dataset

Supplementary material: File

Barberá et al. supplementary material

Barberá et al. supplementary material

Download Barberá et al. supplementary material(File)
File 282.6 KB