Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-28T04:25:39.656Z Has data issue: false hasContentIssue false

What level of automation is “good enough”? A benchmark of large language models for meta-analysis data extraction

Published online by Cambridge University Press:  26 January 2026

Lingbo Li*
Affiliation:
School of Mathematical and Computational Sciences, Massey University , New Zealand
Anuradha Mathrani
Affiliation:
School of Mathematical and Computational Sciences, Massey University , New Zealand
Teo Susnjak
Affiliation:
School of Mathematical and Computational Sciences, Massey University , New Zealand
*
Corresponding author: Lingbo Li; Email: l.li5@massey.ac.nz
Rights & Permissions [Opens in a new window]

Abstract

Automating data extraction from full-text randomized controlled trials for meta-analysis remains a significant challenge. This study evaluates the practical performance of three large language models (LLMs) (Gemini-2.0-flash, Grok-3, and GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customized prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customized prompts were the most effective, boosting recall by up to 15%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 Overview of the whole workflow. Full-text RCTs were collected from published meta-analyses and annotated to construct a ground-truth dataset, which served as the basis for evaluating and comparing multiple extraction methods.

Figure 1

Table 1 Characteristics of the included meta-analyses

Figure 2

Table 2 Overall performance for different extraction approaches across models

Figure 3

Figure 2 Average performance change ($\Delta $ precision and $\Delta $ recall) of three extraction strategies relative to the EXT baseline.

Figure 4

Figure 3 Friedman–Nemenyi critical difference (CD) graph based on mean rank in recall. If the two horizontal line segments in the figure do not overlap, it signifies a significant performance difference between the two methods.

Figure 5

Table 3 Comparison of model performance across different methods

Figure 6

Figure 4 Methods comparison across meta-analyses.

Figure 7

Figure 5 Models comparison across meta-analyses.

Figure 8

Table 4 Best-performing model–method combinations by category

Figure 9

Figure 6 Three-tier automation guideline for structured data extraction in meta-analysis, based on task difficulty, error risk, and need for human oversight. Percentages on the left indicate estimated proportions of total human effort required for each tier for verification of extracted data.

Figure 10

Table 5 Automation priorities, strategies, and roadmap tiers by information category

Supplementary material: File

Li et al. supplementary material

Li et al. supplementary material
Download Li et al. supplementary material(File)
File 155.1 KB