Hostname: page-component-77f85d65b8-g4pgd Total loading time: 0 Render date: 2026-03-28T08:13:38.745Z Has data issue: false hasContentIssue false

StudyTypeTeller—Large language models to automatically classify research study types for systematic reviews

Published online by Cambridge University Press:  11 September 2025

Simona Emilova Doneva*
Affiliation:
Center for Reproducible Science, University of Zurich , Zurich, Switzerland
Shirin de Viragh
Affiliation:
Center for Reproducible Science, University of Zurich , Zurich, Switzerland
Hanna Hubarava
Affiliation:
Center for Reproducible Science, University of Zurich , Zurich, Switzerland
Stefan Schandelmaier
Affiliation:
CLEAR Methods Center, Division of Clinical Epidemiology, Department of Clinical Research, University Hospital Basel, University of Basel , Basel, Switzerland MTA–PTE Lendület “Momentum” Evidence in Medicine Research Group, Medical School, University of Pécs , Pécs, Hungary
Matthias Briel
Affiliation:
CLEAR Methods Center, Division of Clinical Epidemiology, Department of Clinical Research, University Hospital Basel, University of Basel , Basel, Switzerland Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
Benjamin Victor Ineichen
Affiliation:
Center for Reproducible Science, University of Zurich , Zurich, Switzerland Department of Clinical Research, University of Bern, Bern, Switzerland
*
Corresponding author: Simona Emilova Doneva; Email: simona.doneva@uzh.ch
Rights & Permissions [Opens in a new window]

Abstract

Abstract screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Table 1 Annotated corpus: The table presents key statistics of the dataset after the stratified splitting into train, validation, and test sets

Figure 1

Table 2 Overview of employed prompting strategies

Figure 2

Table 3 Performance metrics for GPT-3.5-turbo and GPT-4-turbo-preview (for selected prompting strategy) as well as for BERT models (multi-class classification)

Figure 3

Figure 1 Abstract length per class, calculated on the whole dataset before splitting into train, validation, and test sets.

Figure 4

Figure 2 Per-class performance comparison between the best performing prompting strategy for GPT-3.5 and GPT-4 (P2_H_b3, CC), and SciBERT with 95%-confidence intervals.

Figure 5

Figure 3 Confusion matrices of the best-performing (a) prompting strategy for the GPT-based model (GPT-4, P2_H_b3, CC) and (b) the BERT-based models (SciBERT).

Figure 6

Figure 4 Overview of top ten predicted labels based on the GoldHamster or Multi-Tagger corpus for (a, b) the full dataset, (c) the abstracts annotated as Remaining in our corpus, and (d) the abstracts annotated as Remaining in our corpus and as human by GoldHamster, highlighted in orange in (c). Subfigure (e) shows the top ten labels from Multi-Tagger containing Randomized Control Trials, abbreviated as RCT, and (f) the corresponding assigned labels to this set of articles in our dataset.

Figure 7

Table 4 Top performing models and strategies among the GPT and BERT models, evaluated in the multi-class classification task

Supplementary material: File

Emilova Doneva et al. supplementary material

Emilova Doneva et al. supplementary material
Download Emilova Doneva et al. supplementary material(File)
File 1.1 MB