Hostname: page-component-77f85d65b8-7lfxl Total loading time: 0 Render date: 2026-03-28T23:07:09.021Z Has data issue: false hasContentIssue false

Beyond human gold standards: A multimodel framework for automated abstract classification and information extraction

Published online by Cambridge University Press:  17 November 2025

Delphine S. Courvoisier
Affiliation:
Rheumatology, Geneva University Hospitals , Switzerland Rheumatology, University of Geneva , Switzerland
Diana Buitrago-Garcia
Affiliation:
Rheumatology, University of Geneva , Switzerland
Clément P. Buclin
Affiliation:
Internal Medicine, Geneva University Hospitals , Switzerland
Nils Bürgisser
Affiliation:
Internal Medicine, Geneva University Hospitals , Switzerland
Michele Iudici
Affiliation:
Rheumatology, Geneva University Hospitals , Switzerland Rheumatology, University of Geneva , Switzerland
Denis Mongin*
Affiliation:
Rheumatology, Geneva University Hospitals , Switzerland Rheumatology, University of Geneva , Switzerland
*
Corresponding author: Denis Mongin; Email: denis.mongin@unige.ch
Rights & Permissions [Opens in a new window]

Abstract

Meta-research and evidence synthesis require considerable resources. Large language models (LLMs) have emerged as promising tools to assist in these processes, yet their performance varies across models, limiting their reliability. Taking advantage of the large availability of small size (<10 billion parameters) open-source LLMs, we implemented an agreement-based framework in which a decision is taken only if at least a given number of LLMs produce the same response. The decision is otherwise withheld. This approach was tested on 1020 abstracts of randomized controlled trials in rheumatology, using 2 classic literature review tasks: (1) classifying each intervention as drug or nondrug based on text interpretation and (2) extracting the total number of randomized patients, a task that sometimes required calculations. Re-examining abstracts where at least 4 LLMs disagreed with the human gold standard (dual review with adjudication) allowed constructing an improved gold standard. Compared to a human gold standard and single large LLMs (>70 billion parameters), our framework demonstrated robust performance: several model combinations achieved accuracies above 95% exceeding the human gold standard on at least 85% of abstracts (e.g., 3 of 5 models, 4 of 6 models, or 5 of 7 models). Performance variability across individual models was not an issue, as low-performing models contributed fewer accepted decisions. This agreement-based framework offers a scalable solution that can replace human reviewers for most abstracts, reserving human expertise for more complex cases. Such frameworks could significantly reduce the manual burden in systematic reviews while maintaining high accuracy and reproducibility.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 Structure of the prompt for the 2 classification tasks.

Figure 1

Table 1 Accuracy of individual models, single human reviewer and human gold standard, for 2 tasks: The classification of the intervention (drug versus nondrug) and the number of randomized patients

Figure 2

Figure 2 Considering all combinations of LLMs, for a given number of LLM (x-axis) and a given agreement threshold (filling color), the figure presents the resulting accuracy of the agreement (top panels) and the corresponding percentage of data with an agreement result (bottom panels) for both tasks: classification of the intervention between drug and nondrug (left panels), or extraction of the number of randomized patients (right panels). When the boxplots were too thin, they were replaced by a diamond-like shape.

Figure 3

Figure 3 Considering all combinations of 5 LLMs (one point per combination, 56 points in total) for both tasks: classification of the intervention (left panels), or extraction of the number of randomized patients (right panels): Top panels: comparison between the accuracy of the agreement results (vertical axis) and the mean accuracy of the individual LLMs. Dashed lines indicate the same values for agreement accuracy and individual accuracy. Bottom panels: comparison between the percentage of data with an agreement (vertical axis) and the mean accuracy of the individual LLMs. Mean accuracy is computed by taking the average of the accuracy of each LLM model for a given task. As we considered 8 LLMs, when studying the use of 5 LLMS, for example, we considered all possible combinations of LLMs within the 8 available, with each combination contributing one point in the figure (56 possible LLM combinations in total).

Supplementary material: File

Courvoisier et al. supplementary material

Courvoisier et al. supplementary material
Download Courvoisier et al. supplementary material(File)
File 702.5 KB