Hostname: page-component-89b8bd64d-n8gtw Total loading time: 0 Render date: 2026-05-07T22:12:16.931Z Has data issue: false hasContentIssue false

The Multiclass Classification of Newspaper Articles with Machine Learning: The Hybrid Binary Snowball Approach

Published online by Cambridge University Press:  11 November 2020

Miklós Sebők*
Affiliation:
Centre for Social Sciences, Hungarian Academy of Sciences, Budapest, Hungary. Email: sebok.miklos@tk.mta.hu
Zoltán Kacsuk
Affiliation:
Centre for Social Sciences, Hungarian Academy of Sciences, Budapest, Hungary. Email: sebok.miklos@tk.mta.hu Hochschule der Medien, Stuttgart, Germany
*
Corresponding author Miklós Sebők
Rights & Permissions [Opens in a new window]

Abstract

In this article, we present a machine learning-based solution for matching the performance of the gold standard of double-blind human coding when it comes to content analysis in comparative politics. We combine a quantitative text analysis approach with supervised learning and limited human resources in order to classify the front-page articles of a leading Hungarian daily newspaper based on their full text. Our goal was to assign items in our dataset to one of 21 policy topics based on the codebook of the Comparative Agendas Project. The classification of the imbalanced classes of topics was handled by a hybrid binary snowball workflow. This relies on limited human resources as well as supervised learning; it simplifies the multiclass problem to one of binary choice; and it is based on a snowball approach as we augment the training set with machine-classified observations after each successful round and also between corpora. Our results show that our approach provided better precision results (of over 80% for most topic codes) than what is customary for human coders and most computer-assisted coding projects. Nevertheless, this high precision came at the expense of a relatively low, below 60%, share of labeled articles.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2020. Published by Cambridge University Press on behalf of the Society for Political Methodology
Figure 0

Table 1 Elements of the HBS workflow solution.

Figure 1

Figure 1 HBS in action: A coding process for a virgin corpus with in-process ensemble voting.

Figure 2

Figure 2 The increase of the coded articles for MN by coding rounds.

Figure 3

Figure 3 Precision of MN corpus coding by CAP major topic.

Figure 4

Figure 4 Precision of MN corpus coding by CAP major topic.

Figure 5

Figure 5 Representative words of the boundary topics of environment and energy.