Hostname: page-component-6766d58669-nf276 Total loading time: 0 Render date: 2026-05-16T02:18:37.470Z Has data issue: false hasContentIssue false

‘Super-Unsupervised’ Classification for Labelling Text: Online Political Hostility as an Illustration

Published online by Cambridge University Press:  24 April 2023

Stig Hebbelstrup Rye Rasmussen*
Affiliation:
Political Science, Aarhus University, Copenhagen, Denmark
Alexander Bor
Affiliation:
Political Science, Aarhus University, Copenhagen, Denmark
Mathias Osmundsen
Affiliation:
Political Science, Aarhus University, Copenhagen, Denmark
Michael Bang Petersen
Affiliation:
Political Science, Aarhus University, Copenhagen, Denmark
*
*Corresponding author. E-mail: stighj@hotmail.com
Rights & Permissions [Opens in a new window]

Abstract

We live in a world of text. Yet the sheer magnitude of social media data, coupled with a need to measure complex psychological constructs, has made this important source of data difficult to use. Researchers often engage in costly hand coding of thousands of texts using supervised techniques or rely on unsupervised techniques where the measurement of predefined constructs is difficult. We propose a novel approach that we call ‘super-unsupervised’ learning and demonstrate its usefulness by measuring the psychologically complex construct of online political hostility based on a large corpus of tweets. This approach accomplishes the feat by combining the best features of supervised and unsupervised learning techniques: measurements of complex psychological constructs without a single labelled data source. We first outline the approach before conducting a diverse series of tests that include: (i) face validity, (ii) convergent and discriminant validity, (iii) criterion validity, (iv) external validity, and (v) ecological validity.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Illustration of the super-unsupervised approach. The colours refer to how the proposed approach differs from traditional approaches using word embeddings. The ‘green’ boxes are very typical when using word embeddings, whereas the blue boxes reflect the novel contribution of the super-unsupervised approach.

Figure 1

Table 1. Overview of similarities and differences between dictionary-based, topic modelling based and the super-unsupervised approach towards measuring and defining political hostility

Figure 2

Table 2. Full text for example tweets in Fig. 2

Figure 3

Figure 2. Sample locations for tweets in terms of their distances from the ‘political’ and ‘hate’ word vectors and their Toxicity score. The sample texts are 10 randomly selected tweets from among the first 40,000 tweets scoring closest to ‘political hate’ and 10 randomly selected tweets from among the 40,000 scoring most highly on toxicity. The tweets are sorted separately for toxicity and political hate. The rest of the example tweets can be found in Appendix 1.2.2.

Figure 4

Table 3. Cosine similarity distances for the word vectors ‘political,’ ‘hate’ and ‘political hate.’ For each word vector, the 20 words closest to the given word are extracted

Figure 5

Table 4. Correlation table for the correlations between measures based on tweets (political hate, political, hate, toxicity and sentiment) and the measures from the survey data (political interest, political knowledge, self-reported hostility and gender

Figure 6

Figure 3. The figure illustrates the distance from each word to Political Hate for the Democratic and the Republican classifier of Political Hate. We have included the top 70 words for each group and excluded those that were common to each.

Figure 7

Figure 4. The figure illustrates the development of political hate on Twitter in the period leading up to general elections in Sweden, Denmark, Germany, and Italy in 2014, 2015, 2017, and 2018, respectively.

Supplementary material: File

Hebbelstrup Rye Rasmussen et al. supplementary material
Download undefined(File)
File 3.1 MB
Supplementary material: File

Hebbelstrup_Rye_Rasmussen_et_al._Dataset

Dataset

Download Hebbelstrup_Rye_Rasmussen_et_al._Dataset(File)
File