Hostname: page-component-89b8bd64d-ksp62 Total loading time: 0 Render date: 2026-05-06T19:15:00.360Z Has data issue: false hasContentIssue false

Goodbye human annotators? Content analysis of social policy debates using ChatGPT

Published online by Cambridge University Press:  03 January 2025

Erwin Gielens*
Affiliation:
Department of Sociology, Tilburg University, Tilburg, the Netherlands
Jakub Sowula
Affiliation:
University of Tübingen, Germany and Bern University of Teacher Education, Bern, Switzerland
Philip Leifeld
Affiliation:
Department of Social Statistics, University of Manchester, Manchester, UK
*
Corresponding author: Erwin Gielens; Email: e.e.c.gielens@tilburguniversity.edu
Rights & Permissions [Opens in a new window]

Abstract

Content analysis is a valuable tool for analysing policy discourse, but annotation by humans is costly and time consuming. ChatGPT is a potentially valuable tool to partially automate content analysis for policy debates, largely replacing human annotators. We evaluate ChatGPT’s ability to classify documents using pre-defined argument descriptions, comparing its performance with human annotators for two policy debates: the Universal Basic Income debate on Dutch Twitter (2014–2016) and the pension reforms debate in German newspapers (1993–2001). We use the API (GPT-4 Turbo) and user interface version (GPT-4) and evaluate multiple performance metrics (accuracy, precision and recall). ChatGPT is highly reliable and accurate in classifying pre-defined arguments across datasets. However, precision and recall are much lower, and vary strongly between arguments. These results hold for both datasets, despite differences in language and media type. Moreover, the cut-off method proposed in this paper may aid researchers in navigating the trade-off between detection and noise. Overall, we do not (yet) recommend a blind application of ChatGPT to classify arguments in policy debates. Those interested in adopting this tool should manually validate bot classifications before using them in further analyses. At least for now, human annotators are here to stay.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (https://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Table 1. Top ten arguments from the UBI tweets (descriptions translated from Dutch) and the number of sampled tweets (N) containing these arguments

Figure 1

Table 2. Top ten arguments from the Pension reform newspaper articles (translated descriptions from German) and the number of sampled newspaper articles (N) containing these arguments

Figure 2

Figure 1. Conceptual diagram of a confusion matrix.

Figure 3

Figure 2. Average phi correlation between repeated ChatGPT classifications.

Figure 4

Figure 3. Total performance metrics.

Figure 5

Figure 4. Average accuracy and recall scores based on human-annotated documents (k = 5).

Figure 6

Figure 5. Comparison of performance between UI and API approaches.Note: The range of the crossbars correspond to the best- and worst-performing argument per metric.

Figure 7

Figure 6. Performance of cut-off method compared with average API.Note: The range of the crossbars correspond with the best- and worst-performing argument per metric.

Supplementary material: File

Gielens et al. supplementary material

Gielens et al. supplementary material
Download Gielens et al. supplementary material(File)
File 30.2 KB