Hostname: page-component-5db58dd55d-d6ndz Total loading time: 0 Render date: 2026-06-27T13:40:44.842Z Has data issue: false hasContentIssue false

Prompt engineering of large language models for paper screening in medical meta-analyses and systematic reviews: A prospective comparative study

Published online by Cambridge University Press:  14 April 2026

Till J. Adam*
Affiliation:
Department of Child and Adolescent Psychiatry, Psychosomatic Medicine and Psychotherapy, Charité – Universitätsmedizin Berlin, Germany
Salma A. S. Abosabie
Affiliation:
Department of Radiology, Charite – Universitatsmedizin Berlin, Germany
Max Dittmer
Affiliation:
Department of Physics, Technische Universität Berlin, Germany
Elise Wolf
Affiliation:
School of Business Informatics and Mathematics, Universität Mannheim, Germany
Sara A. Abosabie
Affiliation:
Department of Radiology, Charite – Universitatsmedizin Berlin, Germany
Clara Behnke
Affiliation:
Department of Psychology, Humboldt-Universitat zu Berlin, Germany
Felix Baier
Affiliation:
Department of Psychology, Humboldt-Universitat zu Berlin, Germany
Annabelle Weickmann
Affiliation:
Department of Psychology, Humboldt-Universitat zu Berlin, Germany
Ludwig Köser
Affiliation:
Department of Psychology, University of Potsdam, Germany
Christoph U. Correll
Affiliation:
Department of Child and Adolescent Psychiatry, Psychosomatic Medicine and Psychotherapy, Charité – Universitätsmedizin Berlin, Germany
Niklas Rutsch
Affiliation:
Department of Neurosurgery, Charité – Universitätsmedizin Berlin, Germany
*
Corresponding author: Till J. Adam; Email: till-julius.adam@charite.de
Rights & Permissions [Opens in a new window]

Abstract

Interest in large language models (LLMs) as a tool for meta-analyses and systematic reviews (MA/SRs) is growing. We prospectively developed 515 unique prompts by predefined screening-related categories and tested with open-access LLMs (Llama, Mistral) against four gold-standard MA/SRs from different medical fields published after the LLMs’ training cut-offs, using a Python-based pipeline. Heterogeneity between prompts was quantified, and hypothetical workload/cost reduction with top-performing prompts calculated. Across 12,360 pipeline runs, LLMs versus MA/SRs reached average recall/sensitivity = 83.6 ± 17.0%, precision = 18.5 ± 15.6%, specificity = 36.6 ± 23.7% F1-score = 27.6 ± 17.2%, and accuracy = 61.1 ± 11.0%. F1-scores were significantly higher when prompts focused on methods (0.78 ± 0.40%), explicitly mentioned MA/SR screening (0.81 ± 0.37%), included the comparison MA/SR’s title (5.64 ± 0.37%) or selection criteria (8.05 ± 0.68%), and with more LLM parameters (70b = 4.48 ± 0.31%, 123b = 7.77 ± 0.31%), but lower when screening abstracts instead of titles (−3.67 ± 0.28%). In LLM-base preselection, top-performing F1-score prompts (recall/sensitivity = 72.2%, specificity = 66.1%, precision = 28.6%) would reduce screening demands by 34.5%−37.5%, saving 8.4–8.8 weeks of work and 17,592–18,552. Recall/sensitivity increased with less MA/SR information contrasting F1-score results, which highlights a recall/sensitivity-precision/specificity trade-off. F1-score increased with detailed MA/SR information, while recall/sensitivity increased with shorter, zeroshot prompts. We provide the first prospectively assessed prompt engineering framework for early-stage LLM-based paper screening across medical fields. The publicly available Python pipeline and full prompt list used here support further development of LLM-based evidence synthesis.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NoDerivatives licence (https://creativecommons.org/licenses/by-nd/4.0), which permits re-use, distribution, and reproduction in any medium, provided that no alterations are made and the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Table 1 Overall mean and standard deviation of selection performance metrics by category, minimum and maximum by prompt average

Figure 1

Figure 1 Selection performance across metrics and prompt characteristic categories.Abbreviations: MA/SR, meta-analysis or systematic review; I/E, inclusion/exclusion.

Figure 2

Table 2 Effects of prompt and screening characteristics on paper selection performance

Figure 3

Table 3 Analysis of variance comparing individual prompts across comparison meta-analyses/systematic reviews, large language models, titles and abstracts, with top-performing prompts shown

Figure 4

Table 4 Analysis of variance comparing selection performance metrics across comparison meta-analyses/systematic reviews

Supplementary material: File

Adam et al. supplementary material

Adam et al. supplementary material
Download Adam et al. supplementary material(File)
File 762.6 KB