Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-08T10:14:25.287Z Has data issue: false hasContentIssue false

Selecting More Informative Training Sets with Fewer Observations

Published online by Cambridge University Press:  08 June 2023

Aaron R. Kaufman*
Affiliation:
Division of Social Science, New York University Abu Dhabi, UAE. E-mail: aaronkaufman@nyu.edu
*
Corresponding author Aaron R. Kaufman
Rights & Permissions [Opens in a new window]

Abstract

A standard text-as-data workflow in the social sciences involves identifying a set of documents to be labeled, selecting a random sample of them to label using research assistants, training a supervised learner to label the remaining documents, and validating that model’s performance using standard accuracy metrics. The most resource-intensive component of this is the hand-labeling: carefully reading documents, training research assistants, and paying human coders to label documents in duplicate or more. We show that hand-coding an algorithmically selected rather than a simple-random sample can improve model performance above baseline by as much as 50%, or reduce hand-coding costs by up to two-thirds, in applications predicting (1) U.S. executive-order significance and (2) financial sentiment on social media. We accompany this manuscript with open-source software to implement these tools, which we hope can make supervised learning cheaper and more accessible to researchers.

Information

Type
Letter
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of the Society for Political Methodology
Figure 0

Figure 1 The top two methods substantially outperform both the random sampling and the Taddy (2013) approaches, achieving both accuracy gains and cost reductions. The improvement over random sampling is similarly strong for the Executive Orders than for the StockTwits application. Note: the x-axis locations of each point are jittered for readability.

Supplementary material: File

Kaufman supplementary material
Download undefined(File)
File 318.4 KB
Supplementary material: File

Kaufman_Dataset

Dataset

Download Kaufman_Dataset(File)
File