Hostname: page-component-77f85d65b8-jkvpf Total loading time: 0 Render date: 2026-03-28T13:20:09.628Z Has data issue: false hasContentIssue false

Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction

Published online by Cambridge University Press:  13 November 2025

Antonio Sciurti
Affiliation:
Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Italy
Giuseppe Migliara
Affiliation:
Department of Life Sciences, Health, and Health Professions, Link Campus University, Italy
Leonardo Maria Siena*
Affiliation:
Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Italy
Claudia Isonne
Affiliation:
Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Italy
Maria Roberta De Blasiis
Affiliation:
Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Italy
Alessandra Sinopoli
Affiliation:
Department of Prevention, Local Health Authority Rome 1, Italy
Jessica Iera
Affiliation:
Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Italy Department of Infectious Diseases, Istituto Superiore di Sanità, Italy
Carolina Marzuillo
Affiliation:
Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Italy
Corrado De Vito
Affiliation:
Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Italy
Paolo Villari
Affiliation:
Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Italy
Valentina Baccolini
Affiliation:
Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Italy
*
Corresponding author: Leonardo Maria Siena; Email: leonardo.siena@uniroma1.it
Rights & Permissions [Opens in a new window]

Abstract

Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40 minutes max.) and had usage costs ($0.14–$1.93 per review). Llama 3.1-8B and Gemma 2-9B were run locally in longer times (~4 hours max.) and were free to use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 Visual example of the inclusion decision process for a single record within each systematic review. (1) The record’s title and abstract are embedded into a structured prompt; (2) The prompt is fed into each of the three LLMs; (3) Each LLM rates the record with an integer number from 0 to 100, according to the prompt; (4) If rating meets or exceeds the threshold, record is included (individual LLM decision, ✓: included, ×: excluded); (5) Individual LLM decisions are combined through majority voting; (6) Individual LLM decisions and majority voting are compared with the reviewers’ decision (TP: true positive; TN: true negative; FP: false positive; FN: false negative), for performance assessment.

Figure 1

Table 1 Characteristics of systematic reviews

Figure 2

Table 2 LLM performance metrics, expressed as percentage (%), by systematic review

Figure 3

Figure 2 LLMs ratings ROC curves, by systematic review. (a) VL; (b) AB; (c) COVID-19.Note: LLM: large language model; VL: vaccine literacy; AB: Acinetobacter baumannii; COVID-19: coronavirus disease 2019; AUC: area under the curve; GPT: generative pretrained transformer.

Figure 4

Table 3 LLM performance metrics, expressed as percentage (%), by systematic review

Figure 5

Table 4 Characteristics of LLMs, invalid responses, responses per minute, overall time to responses and overall costs, by LLM and systematic review

Supplementary material: File

Sciurti et al. supplementary material

Sciurti et al. supplementary material
Download Sciurti et al. supplementary material(File)
File 660.4 KB