Hostname: page-component-77c78cf97d-4gwwn Total loading time: 0 Render date: 2026-04-23T23:50:34.100Z Has data issue: false hasContentIssue false

Synthetically generated text for supervised text analysis

Published online by Cambridge University Press:  24 January 2025

Andrew Halterman*
Affiliation:
Michigan State University, East Lansing, MI, USA
Rights & Permissions [Opens in a new window]

Abstract

Large language models are a powerful tool for conducting text analysis in political science, but using them to annotate text has several drawbacks, including high cost, limited reproducibility, and poor explainability. Traditional supervised text classifiers are fast and reproducible, but require expensive hand annotation, which is especially difficult for rare classes. This article proposes using LLMs to generate synthetic training data for training smaller, traditional supervised text models. Synthetic data can augment limited hand annotated data or be used on its own to train a classifier with good performance and greatly reduced cost. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, a simple technique for improving the quality of synthetic text, and an illustration of its limitations. I demonstrate the usefulness of synthetic training through three validations: synthetic news articles describing police responses to communal violence in India for training an event detection system, a multilingual corpus of synthetic populist manifesto statements for training a sentence-level populism classifier, and generating synthetic tweets describing the fighting in Ukraine to improve a named entity system.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Political Methodology
Figure 0

Figure 1 Overview of options for controlling synthetic text generation. Researchers can affect the content and style of synthetic documents by changing language model parameters ($\theta $), by providing new prompts ($w_{i-1}, w_{i-2},...$), or by changing the sampling parameters ($\gamma $). Researchers then decide how to use the synthetic text as training data.

Figure 1

Table 1 Overview of the three approaches to controlling synthetic text generation.

Figure 2

Table 2 Overview of the validations.

Figure 3

Figure 2 Mean F1 performance on an evaluation set for three classes (ARREST, FAIL, KILL) with increasing sizes of the hand-annotated training set (0, 100, 500, 1,000). The red bar shows performance with hand-labeled data only. Orange shows performance with traditional data augmentation (Wei and Zou 2019). The four blue columns show the performance with different numbers of synthetic training examples added to the hand-labeled data. Maximum human annotation performance is shown as a horizontal line. Error bars show the empirical 90% range across 50 samples.

Figure 4

Table 3 Prompts used to generate populist text with language and country placeholders.

Figure 5

Table 4 Performance of classifiers trained on synthetic documents and real labeled documents (1,136 of each) and evaluated on real Manifesto Project text with gold-standard labels.

Figure 6

Figure 3 Test set performance of a named entity recognition model detecting a weapon class, trained on annotated actual tweets and annotated synthetic tweets.

Supplementary material: File

Halterman supplementary material

Halterman supplementary material
Download Halterman supplementary material(File)
File 2.2 MB