Hostname: page-component-89b8bd64d-9prln Total loading time: 0 Render date: 2026-05-07T07:45:30.377Z Has data issue: false hasContentIssue false

Cross-Domain Topic Classification for Political Texts

Published online by Cambridge University Press:  21 October 2021

Moritz Osnabrügge*
Affiliation:
School of Government and International Affairs, Durham University, Durham, United Kingdom. E-mail: moritz.osnabruegge@durham.ac.uk
Elliott Ash
Affiliation:
Center for Law & Economics, ETH Zurich, Zurich, Switzerland. E-mail: ashe@ethz.ch
Massimo Morelli
Affiliation:
Department of Social and Political Sciences, Bocconi University, Milan, Italy. E-mail: massimo.morelli@unibocconi.it
*
Corresponding author Moritz Osnabrügge
Rights & Permissions [Opens in a new window]

Abstract

We introduce and assess the use of supervised learning in cross-domain topic classification. In this approach, an algorithm learns to classify topics in a labeled source corpus and then extrapolates topics in an unlabeled target corpus from another domain. The ability to use existing training data makes this method significantly more efficient than within-domain supervised learning. It also has three advantages over unsupervised topic models: the method can be more specifically targeted to a research question and the resulting topics are easier to validate and interpret. We demonstrate the method using the case of labeled party platforms (source corpus) and unlabeled parliamentary speeches (target corpus). In addition to the standard within-domain error metrics, we further validate the cross-domain performance by labeling a subset of target-corpus documents. We find that the classifier accurately assigns topics in the parliamentary speeches, although accuracy varies substantially by topic. We also propose tools diagnosing cross-domain classification. To illustrate the usefulness of the method, we present two case studies on how electoral rules and the gender of parliamentarians influence the choice of speech topics.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s) 2021. Published by Cambridge University Press on behalf of the Society for Political Methodology
Figure 0

Table 1. Summary of design factors for topic classification methods.

Figure 1

Table 2 Overview of classifier performance in test set.

Figure 2

Table 3 Classifier performance with eight topics: confusion matrices.

Figure 3

Table 4 Classifier performance with 44 topics.

Figure 4

Figure 1 N-gram correlations with topics for source and target corpus. Notes: Scatter plot for the eight topics, showing the t-stats of N-grams in the manifesto corpus (vertical axis) against the t-stat in the speech data.

Figure 5

Figure 2 Feature congruence and cross-domain classification accuracy. Notes: Scatter plot for the 44 topics, showing each topic’s top-3 cross-domain classification accuracy (vertical axis) against the feature congruence, as defined in the text.

Figure 6

Figure 3 Effect of electoral reform on political authority. Notes: Vertical dashed lines indicate the year that the reform passed (1993) and went into effect (1996). The horizontal dashed line indicates the outcome mean in 1995. The bars illustrate 95% confidence intervals.

Supplementary material: Link

Osnabrügge et al. Dataset

Link
Supplementary material: PDF

Osnabrügge et al. supplementary material

Osnabrügge et al. supplementary material

Download Osnabrügge et al. supplementary material(PDF)
PDF 11.9 MB