Hostname: page-component-77f85d65b8-v2srd Total loading time: 0 Render date: 2026-04-17T17:21:31.971Z Has data issue: false hasContentIssue false

Improving Probabilistic Models In Text Classification Via Active Learning

Published online by Cambridge University Press:  05 August 2024

MITCHELL BOSLEY*
Affiliation:
University of Michigan, United States, and University of Toronto, Canada
SAKI KUZUSHIMA*
Affiliation:
University of Michigan, and Harvard University, United States
TED ENAMORADO*
Affiliation:
Washington University in St. Louis, United States
YUKI SHIRAITO*
Affiliation:
University of Michigan, United States
*
Mitchell Bosley, Ph.D. Candidate, Department of Political Science, University of Michigan, United States; Postdoctoral Fellow, Munk School of Global Affairs and Public Policy, and Schwartz Reissman Institute for Technology and Society, University of Toronto, Canada, mcbosley@umich.edu
Saki Kuzushima, Ph.D. Candidate, Department of Political Science, University of Michigan, United States; Postdoctoral Fellow, Program on US-Japan Relations, Harvard University, United States, skuzushi@umich.edu
Corresponding author: Ted Enamorado, Assistant Professor, Department of Political Science, Washington University in St. Louis, United States, ted@wustl.edu
Yuki Shiraito, Assistant Professor, Department of Political Science, University of Michigan, United States, shiraito@umich.edu
Rights & Permissions [Opens in a new window]

Abstract

Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool since it requires less human coding. However, scholars still need many human-labeled documents for training. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that with few labeled data, the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. We replicate the results of two published articles with only a small fraction of the original labeled data used in those studies and provide open-source software to implement our method.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of American Political Science Association
Figure 0

Figure 1. Labeling: Passive vs. Active LearningNote: Panel a presents a corpus where a classifier based on term frequencies of “Spending” and “Gridlock” is utilized to categorize unlabeled (U) documents as political (P) and nonpolitical (N). Panel b depicts a passive learning approach where the next document to be labeled is randomly selected. In contrast, panel c demonstrates active learning, where obtaining the true label of the U located in the region of uncertainty for the classifier (shaded region) is prioritized, as it provides more informative insights into learning the decision boundary between P and N.

Figure 1

Figure 2. Comparison of Classification Results with Random and Active Versions of activeText and DistilBERT

Figure 2

Figure 3. Comparison of Classification and Time Results across activeText and DistilBERT

Figure 3

Figure 4. Classification Results of activeText with and without Keywords

Figure 4

Table 1. Classification Performance: Comparison with Gohdes (2020) Results

Figure 5

Figure 5. Replication of Figure 3 in Gohdes (2020): Expected Proportion of Target Killings, Given Internet Accessibility and Whether a Region is Inhabitated by the Alawi MinorityNote: The results from activeText are presented in the left panel and those of Gohdes (2020) are on the right.

Figure 6

Figure 6. Replication of Figure 1 in Park, Greene, and Colaresi (2020): The Relationship between Information Density and Average Sentiment Score

Figure 7

Figure 7. Bias of the Empirical Risk for Labeled Data (Left Panel) and Out-of-Sample Classification Performance (Right Panel) of activeText and activeText+LURENote: For each panel, the x-axis represents the number of documents labeled and the y-axis represents the average bias and average out-of-sample F1 score across one hundred Monte Carlo simulations. Shaded areas represent the 95% confidence intervals across Monte Carlo simulations.

Supplementary material: File

Bosley et al. supplementary material

Bosley et al. supplementary material
Download Bosley et al. supplementary material(File)
File 981.5 KB
Submit a response

Comments

No Comments have been published for this article.