Hostname: page-component-77f85d65b8-jkvpf Total loading time: 0 Render date: 2026-04-17T15:42:25.125Z Has data issue: false hasContentIssue false

Nonrandom Tweet Mortality and Data Access Restrictions: Compromising the Replication of Sensitive Twitter Studies

Published online by Cambridge University Press:  17 May 2024

Andreas Küpfer*
Affiliation:
Institute for Political Science, Technical University of Darmstadt, Darmstadt, 64283 Hesse, Germany
*
Corresponding author: Andreas Küpfer; Email: andreas.kuepfer@tu-darmstadt.de
Rights & Permissions [Opens in a new window]

Abstract

Used by politicians, journalists, and citizens, Twitter has been the most important social media platform to investigate political phenomena such as hate speech, polarization, or terrorism for over a decade. A high proportion of Twitter studies of emotionally charged or controversial content limit their ability to replicate findings due to incomplete Twitter-related replication data and the inability to recrawl their datasets entirely. This paper shows that these Twitter studies and their findings are considerably affected by nonrandom tweet mortality and data access restrictions imposed by the platform. While sensitive datasets suffer a notably higher removal rate than nonsensitive datasets, attempting to replicate key findings of Kim’s (2023, Political Science Research and Methods 11, 673–695) influential study on the content of violent tweets leads to significantly different results. The results highlight that access to complete replication data is particularly important in light of dynamically changing social media research conditions. Thus, the study raises concerns and potential solutions about the broader implications of nonrandom tweet mortality for future social media research on Twitter and similar platforms.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of The Society for Political Methodology
Figure 0

Figure 1 Different ways of how political scientists share Twitter datasets in replication archives among all 50 papers analyzing the content of tweets in seven major political science journals.

Figure 1

Figure 2 Availability rate of a random sample of up to 10,000 tweets from each of the 16 sensitive and nonsensitive paper datasets which shared at least their tweet IDs. Retrieving all tweets was attempted in cases where the original dataset contained fewer than 10,000 tweets. Temporão et al. (2018) shared user IDs instead of tweet IDs, as the authors crawled all users’ tweets. Thus, I checked the availability of these user accounts instead of tweets. Data was recrawled on May 17, 2023.

Figure 2

Figure 3 Timeline comparison of normalized proportion of violent political rhetoric tweets during the U.S. election 2020 for both the original and recrawled datasets (replication of Kim 2023). Proportions are based on aggregated information on 215,923 original and 35,552 recrawled tweets.

Figure 3

Figure 4 Comparative distribution of mean mentions of accounts in tweets containing violent political rhetoric by gender, party, and position in the original and recrawled datasets. Uncertainty displays the 95% confidence interval of each group. Proportions are based on aggregated information on 215,923 original and 35,552 recrawled tweets.

Figure 4

Figure 5 Comparison of terms grouped by violent (teal) and nonviolent (purple) tweets (a random sample of 5,000 tweets for each group) for both the original (left plot and upper-right panel) and the recrawled datasets (bottom-right panel). The x-axis shows the overall frequency of words in the dataset. The y-axis and the size of a word represent the frequency of words within a group. Following Kim (2023), several preprocessing techniques, such as lowercasing, stopword removal, and stemming, were applied to the tweets’ content.

Figure 5

Figure 6 Comparison of regression model coefficients based on the original, recrawled, and resampled dataset with their 95% confidence intervals. The resampled regression model is a simulation based on rebalanced party and gender ratios following the original dataset distributions.

Supplementary material: File

Küpfer supplementary material

Küpfer supplementary material
Download Küpfer supplementary material(File)
File 694.8 KB