Hostname: page-component-89b8bd64d-b5k59 Total loading time: 0 Render date: 2026-05-12T19:51:12.947Z Has data issue: false hasContentIssue false

Validation of large language models (Llama 3 and ChatGPT-4o mini) for title and abstract screening in biomedical systematic reviews

Published online by Cambridge University Press:  24 March 2025

Adriana López-Pineda
Affiliation:
Network for Research on Chronicity, Primary Care and Health Promotion (RICAPPS), San Juan de Alicante, Spain Clinical Medicine Department, School of Medicine, Miguel Hernandez University, San Juan de Alicante, Spain Primary Care Research Center, Miguel Hernandez University, San Juan de Alicante, Spain
Rauf Nouni-García*
Affiliation:
Network for Research on Chronicity, Primary Care and Health Promotion (RICAPPS), San Juan de Alicante, Spain Clinical Medicine Department, School of Medicine, Miguel Hernandez University, San Juan de Alicante, Spain Primary Care Research Center, Miguel Hernandez University, San Juan de Alicante, Spain
Álvaro Carbonell-Soliva
Affiliation:
Network for Research on Chronicity, Primary Care and Health Promotion (RICAPPS), San Juan de Alicante, Spain Clinical Medicine Department, School of Medicine, Miguel Hernandez University, San Juan de Alicante, Spain Primary Care Research Center, Miguel Hernandez University, San Juan de Alicante, Spain
Vicente F Gil-Guillén
Affiliation:
Network for Research on Chronicity, Primary Care and Health Promotion (RICAPPS), San Juan de Alicante, Spain Clinical Medicine Department, School of Medicine, Miguel Hernandez University, San Juan de Alicante, Spain Primary Care Research Center, Miguel Hernandez University, San Juan de Alicante, Spain
Concepción Carratalá-Munuera
Affiliation:
Network for Research on Chronicity, Primary Care and Health Promotion (RICAPPS), San Juan de Alicante, Spain Clinical Medicine Department, School of Medicine, Miguel Hernandez University, San Juan de Alicante, Spain Primary Care Research Center, Miguel Hernandez University, San Juan de Alicante, Spain
Fernando Borrás
Affiliation:
Department of Statistics, Mathematics and Informatics, Miguel Hernandez University, San Juan de Alicante, Spain
*
Corresponding author: Rauf Nouni-García; Email: rnouni@umh.es
Rights & Permissions [Opens in a new window]

Abstract

With the increasing volume of scientific literature, there is a need to streamline the screening process for titles and abstracts in systematic reviews, reduce the workload for reviewers, and minimize errors. This study validated artificial intelligence (AI) tools, specifically Llama 3 70B via Groq’s application programming interface (API) and ChatGPT-4o mini via OpenAI’s API, for automating this process in biomedical research. It compared these AI tools with human reviewers using 1,081 articles after duplicate removal. Each AI model was tested in three configurations to assess sensitivity, specificity, predictive values, and likelihood ratios. The Llama 3 model’s LLA_2 configuration achieved 77.5% sensitivity and 91.4% specificity, with 90.2% accuracy, a positive predictive value (PPV) of 44.3%, and a negative predictive value (NPV) of 97.9%. The ChatGPT-4o mini model’s CHAT_2 configuration showed 56.2% sensitivity, 95.1% specificity, 92.0% accuracy, a PPV of 50.6%, and an NPV of 96.1%. Both models demonstrated strong specificity, with CHAT_2 having higher overall accuracy. Despite these promising results, manual validation remains necessary to address false positives and negatives, ensuring that no important studies are overlooked. This study suggests that AI can significantly enhance efficiency and accuracy in systematic reviews, potentially revolutionizing not only biomedical research but also other fields requiring extensive literature reviews.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open data
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Table 1 Configuration of the AI tools

Figure 1

Figure 1 Flowchart of the screening process through manual review and artificial intelligence tools.

Figure 2

Table 2 Validity indicators for each AI configuration

Figure 3

Table 3 Validity indicators for each AI configuration (continuation)

Supplementary material: File

López-Pineda et al. supplementary material

López-Pineda et al. supplementary material
Download López-Pineda et al. supplementary material(File)
File 43.2 KB