Hostname: page-component-77f85d65b8-6c7dr Total loading time: 0 Render date: 2026-03-28T02:30:51.778Z Has data issue: false hasContentIssue false

Optimal large language models to screen citations for systematic reviews

Published online by Cambridge University Press:  23 June 2025

Takehiko Oami*
Affiliation:
Department of Emergency and Critical Care Medicine, Chiba University Graduate School of Medicine, Chiba, Japan
Yohei Okada
Affiliation:
Department of Preventive Services, Kyoto University Graduate School of Medicine, Kyoto, Japan Pre-hospital and Emergency Research Centre, Health Services and Systems Research, Duke-NUS Medical School, National University of Singapore Singapore
Taka-aki Nakada
Affiliation:
Department of Emergency and Critical Care Medicine, Chiba University Graduate School of Medicine, Chiba, Japan
*
Corresponding author: Takehiko Oami; Email: seveneleven711thanks39@msn.com
Rights & Permissions [Opens in a new window]

Abstract

Recent studies highlight the potential of large language models (LLMs) in citation screening for systematic reviews; however, the efficiency of individual LLMs for this application remains unclear. This study aimed to compare accuracy, time-related efficiency, cost, and consistency across four LLMs—GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B—for literature screening tasks. The models screened for clinical questions from the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024. Sensitivity and specificity were calculated for each model based on conventional citation screening results for qualitative assessment. We also recorded the time and cost of screening and assessed consistency to verify reproducibility. A post hoc analysis explored whether integrating outputs from multiple models could enhance screening accuracy. GPT-4o and Llama 3.3 70B achieved high specificity but lower sensitivity, while Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited higher sensitivity at the cost of lower specificity. Citation screening times and costs varied, with GPT-4o being the fastest and Llama 3.3 70B the most cost-effective. Consistency was comparable among the models. An ensemble approach combining model outputs improved sensitivity but increased the number of false positives, requiring additional review effort. Each model demonstrated distinct strengths, effectively streamlining citation screening by saving time and reducing workload. However, reviewing false positives remains a challenge. Combining models may enhance sensitivity, indicating the potential of LLMs to optimize systematic review workflows.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Table 1 List of the patient/population/problem, intervention, comparison, and study design of the selected CQs

Figure 1

Figure 1 Schematic overview of systematic review between large language model (LLM)-assisted citation screening and the conventional method. Flowchart of the systematic review process: identification, title and abstract screening, and full-text screening. This figure also illustrates the timing of accuracy statistics for LLM-assisted citation screening in the primary and secondary analyses. CQ: clinical question.

Figure 2

Figure 2 Comparison of four large language models in terms of the accuracy of citation screening: primary analysis. Our primary analysis used the results of the included publications for qualitative assessment, using the conventional method as the standard reference. The individual sensitivity and specificity for each clinical question (CQ) and the integrated sensitivity values across CQs 1–5 were compared among GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B using confidence intervals.

Figure 3

Figure 3 Comparison of four large language models in terms of the accuracy of citation screening: secondary analysis. Our secondary analysis used the results of the included publications for a full-text screening session, using the conventional method as the standard reference. The individual sensitivity and specificity for each clinical question (CQ) and the integrated sensitivity values across CQs 1–5 were compared among GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B using confidence intervals.

Figure 4

Figure 4 Comparison of citation screening time, cost for 100 studies, and consistency among the four large language models (LLMs). The times needed to process 100 studies, at the 95% confidence interval, using the GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B LLMs were 0.93 [0.92–0.98] min, 1.53 [1.49–1.74] min, 3.25 [3.10–3.79] min, and 1.20 [1.12–1.24] min, respectively (a). The overall citation screening costs for 100 studies (at the 95% confidence interval) for GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B were $0.41 [0.34–0.48], $0.32 [0.22–0.42], $0.39 [0.34–0.43], and $0 [0–0], respectively (b). Consistency rates were calculated using the number of agreements on inclusion or exclusion, and the total number of citations between the three sessions of citation screening. The results for GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B ranged from 95.4% to 99.3%, 96.1% to 98.7%, 95.4% to 96.8%, and 96.5% to 99.1%, respectively (c).

Figure 5

Figure 5 Post hoc analysis for our secondary analysis using an ensemble method. The results for the included publications were qualitatively analyzed using the conventional method as the standard reference. As a large language model ensemble method, publications included by either of the LLMs were counted as included publications. The individual sensitivity and specificity values for each clinical question (CQ), as well as the integrated sensitivity values across CQs 1–5, are presented using confidence intervals.

Supplementary material: File

Oami et al. supplementary material

Oami et al. supplementary material
Download Oami et al. supplementary material(File)
File 85.7 KB