Hostname: page-component-5db58dd55d-lqwgf Total loading time: 0 Render date: 2026-05-30T14:42:37.699Z Has data issue: false hasContentIssue false

Automated Text Classification of News Articles: A Practical Guide

Published online by Cambridge University Press:  09 June 2020

Pablo Barberá*
Affiliation:
Associate Professor of Political Science and International Relations, University of Southern California, Los Angeles, CA 90089, USA. Email: pbarbera@usc.edu
Amber E. Boydstun
Affiliation:
Associate Professor of Political Science, University of California, Davis, CA 95616, USA. Email: aboydstun@ucdavis.edu
Suzanna Linn
Affiliation:
Liberal Arts Professor of Political Science, Department of Political Science, Penn State University, University Park, PA 16802, USA. Email: sld8@psu.edu
Ryan McMahon
Affiliation:
PhD Graduate, Department of Political Science, Penn State University, University Park, PA 16802, USA (now at Google). Email: mcmahon.rb@gmail.com
Jonathan Nagler
Affiliation:
Professor of Politics and co-Director of the Center for Social Media and Politics, New York University, New York, NY 10012, USA. Email: jonathan.nagler@nyu.edu

Abstract

Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.

Information

Type
Articles
Copyright
Copyright © The Author(s) 2020. Published by Cambridge University Press on behalf of the Society for Political Methodology.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Supplementary material: Link

Barberá et al. Dataset

Link
Supplementary material: File

Barberá et al. supplementary material

Barberá et al. supplementary material

Download Barberá et al. supplementary material(File)
File 282.6 KB