Hostname: page-component-77c78cf97d-bzm8f Total loading time: 0 Render date: 2026-04-23T07:02:26.442Z Has data issue: false hasContentIssue false

Synthetic Replacements for Human Survey Data? The Perils of Large Language Models

Published online by Cambridge University Press:  17 May 2024

James Bisbee*
Affiliation:
Political Science Department, Vanderbilt University, Nashville, TN, USA
Joshua D. Clinton
Affiliation:
Political Science Department, Vanderbilt University, Nashville, TN, USA
Cassy Dorff
Affiliation:
Political Science Department, Vanderbilt University, Nashville, TN, USA
Brenton Kenkel
Affiliation:
Political Science Department, Vanderbilt University, Nashville, TN, USA
Jennifer M. Larson
Affiliation:
Political Science Department, Vanderbilt University, Nashville, TN, USA
*
Corresponding author: James Bisbee; Email: james.h.bisbee@vanderbilt.edu
Rights & Permissions [Opens in a new window]

Abstract

Large language models (LLMs) offer new research possibilities for social scientists, but their potential as “synthetic data” is still largely unknown. In this paper, we investigate how accurately the popular LLM ChatGPT can recover public opinion, prompting the LLM to adopt different “personas” and then provide feeling thermometer scores for 11 sociopolitical groups. The average scores generated by ChatGPT correspond closely to the averages in our baseline survey, the 2016–2020 American National Election Study (ANES). Nevertheless, sampling by ChatGPT is not reliable for statistical inference: there is less variation in responses than in the real surveys, and regression coefficients often differ significantly from equivalent estimates obtained using ANES data. We also document how the distribution of synthetic responses varies with minor changes in prompt wording, and we show how the same prompt yields significantly different results over a 3-month period. Altogether, our findings raise serious concerns about the quality, reliability, and reproducibility of synthetic survey data generated by LLMs.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of The Society for Political Methodology
Figure 0

Figure 1 Average feeling thermometer results (x-axis) for different target groups (y-axis) by prompt type/timing (columns). Average ANES estimates from the 2016 and 2020 waves indicated with red triangles and one standard deviation indicated with thick red bars. LLM-derived averages indicated with black circles and thin black bars. Sample sizes for each groupwise comparison are identical.

Figure 1

Figure 2 Average feeling thermometer results (x-axis) for different target groups (y-axes) by party ID of respondent (columns). Average ANES estimates from the 2016 and 2020 waves indicated with red triangles and one standard deviation indicated with thick red bars. LLM-derived averages indicated by black circles and thin black bars. Sample sizes for each groupwise comparison are identical.

Figure 2

Table 1 Calculations of the sample size necessary for a specified power to reject the null hypothesis of no difference in affective polarization among partisans from the average level in the 2012 ANES, assuming a 95% significance level. The second column records the calculation if we assume an effect size and variance equal to the 2016–2020 pooled ANES values (size 7.8, sd 31.4); the third column is the same calculation with our ChatGPT estimates (size 12.5, sd 16.1).

Figure 3

Figure 3 Each point describes the coefficient estimate capturing the partial correlation between a covariate and a feeling thermometer score toward one of the target groups, estimated in either 2016 or 2020. The x-axis position is the coefficient estimated in the ANES data, and the y-axis position is the same coefficient estimated in the synthetic data. Solid points indicate coefficients who are significantly different when estimated in either the ANES or synthetic data, while hollow points are coefficients that are not significantly different. Points in the northeast and southwest quadrants generate the same substantive interpretations, while those in the northwest and southeast quadrants produce differing interpretations. A synthetic dataset that is able to perfectly recover relationships estimated in the ANES data would have all points falling along the 45-degree line.

Figure 4

Figure 4 Mean absolute error (MAE; x-axes) associated with different target groups (y-axes) by partisanship (columns) for different prompts to generate the synthetic data. MAE is calculated as the absolute difference between the human respondent’s feeling thermometer score for a given target group in the ANES data, relative to the average of 30 synthetic respondents drawn who match the human respondent in terms of their demographics only (light gray circles), political attributes only (dark gray triangles), or both demographic and political attributes combined (black squares).

Figure 5

Figure 5 Reproducibility of synthetic data over time. Both plots compare the synthetic dataset generated by a simple prompt gathered in April of 2023 to the identical prompt rerun in June (left facet) and July (right facet) of the same year. Each point indicates the number of observations associated with April versus later synthetic datasets, aggregating across respondents and target groups. Linear regression equation indicated in the top left of both facets, revealing substantially attenuated differences between the April and July runs of the same prompt.

Supplementary material: File

Bisbee et al. supplementary material

Bisbee et al. supplementary material
Download Bisbee et al. supplementary material(File)
File 13.9 MB