Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-28T01:57:13.518Z Has data issue: false hasContentIssue false

Improving short text classification with augmented data using GPT-3

Published online by Cambridge University Press:  25 August 2023

Salvador V. Balkus*
Affiliation:
Program in Data Science, University of Massachusetts Dartmouth, Dartmouth, MA, USA
Donghui Yan
Affiliation:
Department of Mathematics, University of Massachusetts Dartmouth, Dartmouth, MA, USA
*
Corresponding author: Salvador V. Balkus; Email: sbalkus@g.harvard.edu
Rights & Permissions [Opens in a new window]

Abstract

GPT-3 is a large-scale natural language model developed by OpenAI that can perform many different tasks, including topic classification. Although researchers claim that it requires only a small number of in-context examples to learn a task, in practice GPT-3 requires these training examples to be either of exceptional quality or a higher quantity than easily created by hand. To address this issue, this study teaches GPT-3 to classify whether a question is related to data science by augmenting a small training set with additional examples generated by GPT-3 itself. This study compares two augmented classifiers: the Classification Endpoint with an increased training set size and the Completion Endpoint with an augmented prompt optimized using a genetic algorithm. We find that data augmentation significantly increases the accuracy of both classifiers, and that the embedding-based Classification Endpoint achieves the best accuracy of about 76%, compared to human accuracy of 85%. In this way, giving large language models like GPT-3 the ability to propose their own training examples can improve short text classification performance.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Our Classification Endpoint Augmentation. First, new artificial questions are created using the ability of the GPT-3 Completion Endpoint to generate text based on existing examples. Then, newly generated questions are used to train the Classification Endpoint to, given an input, produce a more accurate output.

Figure 1

Figure 2. Example of an input to the GPT-3 Completion Endpoint interface. By organizing text in the question-topic-question-topic pattern, GPT-3 can be instructed to output labels classifying the topic of a question. The output of the model is highlighted.

Figure 2

Figure 3. Graphical depiction of the genetic algorithm for selecting optimal augmented in-context examples for the GPT-3 Completion Endpoint. Each candidate consists of a set of alleles representing questions provided to the Completion Endpoint prompt. At each generation, the candidates with the best accuracy are selected to produce offspring with new candidates containing augmented examples generated by GPT-3.

Figure 3

Figure 4. Steps in the genetic algorithm for Completion Endpoint augmentation.

Figure 4

Table 1. Genetic algorithm parameters for Completion Endpoint in-context example selection optimization

Figure 5

Table 2. GPT-3 Classification Endpoint performance on data science question topic classification, additional examples generated using GPT-3 Davinci Completion. $p$-values test for significance from results with 0 additional examples using a permutation test for difference in means

Figure 6

Figure 5. GPT-3 Classification Endpoint mean performance with standard errors on data science question topic classification. Training data are augmented by adding different quantities of new examples generated with GPT-3 Davinci Completion.

Figure 7

Figure 6. Genetic algorithm performance for selecting best in-context examples for the GPT-3 Completion Endpoint. Results from each individual trial are compared to the baseline, which represents the expected performance of random guessing, in terms of classification accuracy on the 26-question validation set.

Figure 8

Figure 7. Averaged genetic algorithm performance for selecting best in-context examples for the GPT-3 Completion Endpoint with standard error across three trials. Performance is measured in terms of classification accuracy on the 26-question validation set.

Figure 9

Figure 8. Average proportion of original population of alleles, candidates, and alleles from the best candidate that remain in the population at each subsequent generation across three trials. This plot shows the average replacement rate of each over time.

Figure 10

Table 3. Comparison of augmented GPT-3 Completion Endpoint performance on data science question classification, in-context examples selected using genetic algorithm ($n_{\text{trials}}=3$), to augmented Classification Endpoint

Figure 11

Table 4. Proportion of questions for which a given fraction of post hoc reviewers agreed with the original label. Human accuracy is reported as the fraction of post hoc labels that matched the original label