Hostname: page-component-5db58dd55d-l8wb7 Total loading time: 0 Render date: 2026-06-27T00:38:26.235Z Has data issue: false hasContentIssue false

What level of automation is “good enough”? A benchmark of large language models for meta-analysis data extraction

Published online by Cambridge University Press:  26 January 2026

Lingbo Li*
Affiliation:
School of Mathematical and Computational Sciences, Massey University , New Zealand
Anuradha Mathrani
Affiliation:
School of Mathematical and Computational Sciences, Massey University , New Zealand
Teo Susnjak
Affiliation:
School of Mathematical and Computational Sciences, Massey University , New Zealand
*
Corresponding author: Lingbo Li; Email: l.li5@massey.ac.nz
Rights & Permissions [Opens in a new window]

Abstract

Automating data extraction from full-text randomized controlled trials for meta-analysis remains a significant challenge. This study evaluates the practical performance of three large language models (LLMs) (Gemini-2.0-flash, Grok-3, and GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customized prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customized prompts were the most effective, boosting recall by up to 15%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 Overview of the whole workflow. Full-text RCTs were collected from published meta-analyses and annotated to construct a ground-truth dataset, which served as the basis for evaluating and comparing multiple extraction methods.Figure 1 Long description.

Figure 1

Table 1 Characteristics of the included meta-analysesTable 1 Long description.

Figure 2

Table 2 Overall performance for different extraction approaches across modelsTable 2 Long description.

Figure 3

Figure 2 Average performance change (Δ$\Delta $ precision and Δ$\Delta $ recall) of three extraction strategies relative to the EXT baseline.

Figure 4

Figure 3 Friedman–Nemenyi critical difference (CD) graph based on mean rank in recall. If the two horizontal line segments in the figure do not overlap, it signifies a significant performance difference between the two methods.Figure 3 Long description.

Figure 5

Table 3 Comparison of model performance across different methodsTable 3 Long description.

Figure 6

Figure 4 Methods comparison across meta-analyses.Figure 4 Long description.

Figure 7

Figure 5 Models comparison across meta-analyses.Figure 5 Long description.

Figure 8

Table 4 Best-performing model–method combinations by categoryTable 4 Long description.

Figure 9

Figure 6 Three-tier automation guideline for structured data extraction in meta-analysis, based on task difficulty, error risk, and need for human oversight. Percentages on the left indicate estimated proportions of total human effort required for each tier for verification of extracted data.Figure 6 Long description.

Figure 10

Table 5 Automation priorities, strategies, and roadmap tiers by information categoryTable 5 Long description.

Supplementary material: File

Li et al. supplementary material

Li et al. supplementary material
Download Li et al. supplementary material(File)
File 155.1 KB