Beyond human gold standards: A multimodel framework for automated abstract classification and information extraction

Delphine S. Courvoisier; Diana Buitrago-Garcia; Clément P. Buclin; Nils Bürgisser; Michele Iudici; Denis Mongin

doi:10.1017/rsm.2025.10054

Beyond human gold standards: A multimodel framework for automated abstract classification and information extraction

Published online by Cambridge University Press: 17 November 2025

Delphine S. Courvoisier

Diana Buitrago-Garcia ,

Michele Iudici and

Delphine S. Courvoisier: Affiliation:
Rheumatology, Geneva University Hospitals , Switzerland Rheumatology, University of Geneva , Switzerland
Diana Buitrago-Garcia: Affiliation:
Rheumatology, University of Geneva , Switzerland
Clément P. Buclin: Affiliation:
Internal Medicine, Geneva University Hospitals , Switzerland
Nils Bürgisser: Affiliation:
Internal Medicine, Geneva University Hospitals , Switzerland
Michele Iudici: Affiliation:
Rheumatology, Geneva University Hospitals , Switzerland Rheumatology, University of Geneva , Switzerland
Denis Mongin*: Affiliation:
Rheumatology, Geneva University Hospitals , Switzerland Rheumatology, University of Geneva , Switzerland
*: Corresponding author: Denis Mongin; Email: denis.mongin@unige.ch

Article contents

Abstract
Highlights
Introduction
Methods
Results
Discussion
Conclusion
Author contributions
Competing Interest Statement
Data availability statement
Funding statement
Role of the funder/sponsor
Footnotes
References

Rights & Permissions

Abstract

Meta-research and evidence synthesis require considerable resources. Large language models (LLMs) have emerged as promising tools to assist in these processes, yet their performance varies across models, limiting their reliability. Taking advantage of the large availability of small size (<10 billion parameters) open-source LLMs, we implemented an agreement-based framework in which a decision is taken only if at least a given number of LLMs produce the same response. The decision is otherwise withheld. This approach was tested on 1020 abstracts of randomized controlled trials in rheumatology, using 2 classic literature review tasks: (1) classifying each intervention as drug or nondrug based on text interpretation and (2) extracting the total number of randomized patients, a task that sometimes required calculations. Re-examining abstracts where at least 4 LLMs disagreed with the human gold standard (dual review with adjudication) allowed constructing an improved gold standard. Compared to a human gold standard and single large LLMs (>70 billion parameters), our framework demonstrated robust performance: several model combinations achieved accuracies above 95% exceeding the human gold standard on at least 85% of abstracts (e.g., 3 of 5 models, 4 of 6 models, or 5 of 7 models). Performance variability across individual models was not an issue, as low-performing models contributed fewer accepted decisions. This agreement-based framework offers a scalable solution that can replace human reviewers for most abstracts, reserving human expertise for more complex cases. Such frameworks could significantly reduce the manual burden in systematic reviews while maintaining high accuracy and reproducibility.

Keywords

classification data extraction evidence synthesis large language model meta-analysis

Information

Type: Research Article
Information: Research Synthesis Methods , First View , pp. 1 - 13

DOI: https://doi.org/10.1017/rsm.2025.10054 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open data Open materials
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology

Highlights

What is already known?

• Use of LLM for research synthesis is promising for abstract classification and data extraction.
• However, performance varies considerably between models.
• A key limitation is the tendency of LLMs to hallucinate or to fail to recognize when they do not know the answer, which can hinder achieving the sensitivity required for research synthesis.

What is new?

• An agreement-based approach using multiple open-source small LLMs (<10B parameters) achieved >95% accuracy in classifying drug versus nondrug interventions and extracting sample sizes from article abstracts.
• The method demonstrated robust performance even when individual LLMs varied in accuracy, as the agreement strategy effectively filtered out errors.
• The study showcased that a fixed decision threshold can balance accuracy and coverage, ensuring high precision while flagging uncertain cases for human review.

Potential impact for RSM readers

• Our proposed framework ensures reliable use of AI screening for classification or data extraction tasks, replacing data extraction by human for most abstracts.
• Our study encourages the clinical research community to consider open-source, locally deployable AI tools as viable alternatives to proprietary models, addressing ethical and access concerns.

1 Introduction

Meta-research and evidence synthesis research compile and analyze evidence from primary studies relevant to clinical fields, public health, social sciences, and other areas of knowledge.¹^, Reference Ioannidis² Due to the exponential increase in the number of scientific publications,Reference Bornmann, Haunschild and Mutz³^– Reference Bornmann and Mutz⁵ these research projects require considerable resources, especially in time and human effort.Reference Borah, Brown, Capers and Kaiser⁶ One of the main tasks is the selection of studies based on their abstract and title. Machine learning has recently gained attention as a potential tool to reduce workload and enhance efficiency when performing such work. Classifiers such as random forest or neuronal networks were first considered to help with the title and abstract screening process in a semiautomated way.Reference Affengruber, van der Maten and Spiero⁷ Currently, the transformer architecture and the subsequent era of large language models (LLMs) are investigated to help reduce human workload in the screening phase. A first approach has been to train small LLMs on specific classifying tasks,Reference Aum and Choe⁸ which produced promising results but had the disadvantage of being very specific to the task and the training dataset. The fast improvement of LLMs has now produced highly versatile algorithms able to perform various tasks, including text completion, translation, summarization, and answering questions.Reference Minaee, Mikolov and Nikzad⁹ LLMs have shown abilities close to or above human level in various areas such as passing exams,Reference Katz, Bommarito, Gao and Arredondo¹⁰ assessing trends and topics presented in academic meetings,Reference Fijačko, Creber and Abella¹¹ developing prediction models in neuroscience,Reference Luo, Rechardt and Sun¹² diagnostic reasoning,Reference Goh, Gallo and Hom¹³ or critically inspecting research papers.Reference Liang, Zhang and Cao¹⁴

In evidence synthesis, LLMs have shown promising results for abstract classification screening tasks,Reference Oami, Okada and Nakada¹⁵^– Reference Li, Sun and Tan¹⁷ or data extraction from published studies,Reference Gartlehner, Kahwati and Hilscher¹⁸ yet without reaching the performance thresholds needed for fully automated tasks so far.Reference Lieberum, Töws and Metzendorf¹⁹ Furthermore, the use of LLMs for such tasks suffers from several limitations. First, their performance vary with the specific model used and the task tested, raising concerns about generalizability and reproducibility of the results.Reference Chae and Davidson²⁰^– Reference Nguyen-Trung, Saeri and Kaufman²³ Second, regarding reproducibility, information on retraining of the models is often lacking, which may lead to unknown changes in performance. Third, larger models tend to give wrong answers more confidently,Reference Zhou, Schellaert, Martínez-Plumed, Moros-Daval, Ferri and Hernández-Orallo²⁴ raising concerns of potential harm for evidence synthesis. Finally, most of the studies so far rely on proprietary cloud-based LLMs, leading to ethical questions about data ownership and access to proprietary or sensitive information when using full texts.Reference Ong, Chang and William²⁵

Taking advantage of the large availability of medium- (between 10 and 40 billion parameters) or small-sized (<10 billion parameters) open-source LLMs,Reference Minaee, Mikolov and Nikzad⁹ we implemented an agreement-based approach in which a decision is taken only if at least a given number of LLMs produce the same response (decision threshold) or the decision is withheld. This approach was tested on abstracts of randomized controlled trials in rheumatology published from 2009 to 2022,Reference Mongin, Buitrago-Garcia and Capderou²⁶ using 8 different LLMs. The LLMs performed 2 classic literature review tasks: (1) classifying each intervention as drug or nondrug based on text interpretation and (2) extracting the total number of randomized patients. Results were compared to state-of-the-art LLMs as well as conventional human gold standard.²⁷ A “platinum standard” was obtained by humans thoroughly rechecking all abstracts where at least 4 LLMs had all agreed to a different answer compared to the human gold standard. LLM decision results and the human gold standard were then compared to this new platinum standard.

2 Methods

2.1 Abstracts analyzed

The included abstracts are primary reports of RCTs in rheumatology published between 2009 and 2022, used in a previous study.Reference Mongin, Buitrago-Garcia and Capderou²⁶ They were used for 2 tasks:

- Binary classification where the abstracts were classified based on interventions testing a drug or not (nondrug).
- Extraction of the number of randomized patients in the RCTs, or reporting as “missing” if the information was not present.

For the second task, humans also reported if the number was given as total or by arm, thus requiring calculations.

2.2 List of LLMs considered

In all, 8 openly available LLMs of less than 10 billion parameters were considered:

- Llama 3 8B from Meta AI²⁸
- Ministral 8B from Mistral²⁹
- Qwen 2.5 7B from Alibaba Qwen³⁰
- Yi 1.5 9B from 01.ai³¹
- Gemma 2 9B from Google³²
- Deepseek 7B from Deepseek AIReference Deepseek³³
- Phi 3 small (7B) from Microsoft³⁴
- Aya expanse 8B from Cohere For AI³⁵

We used the instruct version of these models, which were downloaded from Huggingface and used in python through the transformer library.

Two state-of-the-art models were considered:

- Llama 3.3 70B, from Meta AI³⁶
- Qwen 2.5 72B, from Alibaba Qwen³⁷

2.3 The gold standard

Both tasks were performed independently by 2 human reviewers (health care professionals: statisticians and epidemiologists). In case of disagreement, adjudication was performed by a third reviewer, yielding the human gold standard.

2.4 The platinum standard

When more than 4 LLMs provided a concordant answer that was different from the human gold standard, the human gold standard was revised by a fourth reviewer, who had access to the LLM justifications. The subsequent reference, named the platinum standard, was considered as the reference standard in all future analyses.

2.5 LLM inference

Let ${N}_{LLM}$ be the number of LLMs included in the task and ${N}_\textit{thres}$ the agreement threshold. Our framework considered that an answer was valid if provided by at least ${N}_\textit{thres}$ LLMs. If this number of LLMs agreeing was not reached, then the task was considered as nonanswered for the given abstract (Figure 1). For the classification task, agreement was reached if ${N}_\textit{thres}$ LLMs classified the abstract as being of the same category (drug or nondrug). For instance, if ${N}_\textit{thres}$ is 5 and 7 LLMs considered, and 5 LLMs or more classify the abstract as nondrug, then agreement is considered reached. If only 4 LLMs classify the abstract as nondrug, then agreement is not reached, and the task is considered as nonanswered. For the sample size task, agreement was reached if ${N}_\textit{thres}$ LLMs reported the exact same number.

Figure 1 Structure of the prompt for the 2 classification tasks.

${N}_{LLM}$ was varied between 3 and 7, and all combinations among the 8 LLMs were tested, leading to 56, 70, 56, 28, and 8 LLM combinations, respectively. ${N}_\textit{thres}$ varied between half ${N}_{LLM}$ rounded to the lowest integer and ${N}_{LLM}$ .

For example, considering the sample size task, fixing ${N}_{LLM}$ as 6 and ${N}_\textit{thres}$ as 4, each abstract would be evaluated by a selection of 6 specific LLMs among 8, and the procedure would provide a classification only if 4 models provided the exact same sample size (perfect agreement). This procedure would then be done for each possible draw of 6 LLMs among 8.

2.6 Prompt and parameters

All LLMs were given the same prompt and parameters. We set a random sampling with cumulative probability and the most likely next word sampling (top_p and top_k), considering top_k = 40 and top_p = 0.95 and a temperature of 0.4. A sensitivity analysis varied the top_p, top_k, and temperature parameters, using the Qwen 2.5 7B model. We performed one inference without sampling (greedy search), and, with sampling, the top_p parameter was varied from 0.95 to 0.80, top_k from 40 to 20, and temperature between 0.2, 0.4, and 0.8.

The prompt was structured as follows (Figure 1):

- Definition of LLM role and objective
- Definition of the task and context (what is a drug and nondrug intervention, what is the number of randomized patients)
- Definition of the output structure in JSON, with a first entry for an explanation, following the idea of chain of thought structureReference Brown, Mann and Ryder³⁸
- few shots prompting (5 examples)

Prompts, python code, inference results, computer configuration, and R code for the analysis are available at https://gitlab.unige.ch/trial_integrity/llm_majority_public.

2.7 Statistical analysis

The main performance metric for both tasks is the accuracy, defined as the number of correct answers compared to the platinum reference divided by the total number of queries. The accuracy between the combination of LLMs and the human gold standard was compared using t-test.

For the classification task, recall (also called sensitivity, defined as the proportion of the label properly guessed) and F1 scores, defined as the harmonic mean of the precision and recall (or positive predictive value) are also provided for both categories (drug and nondrug).

All analyses and figures were performed in R 4.4.2.³⁹

3 Results

The 1020 included abstracts were published in more than 20 journals, the most frequent being BMC Musculoskeletal Disorders (17.9%), followed by Annals of the Rheumatic Diseases (14.3%). Wide scope medical journals also contributed (e.g., Lancet: 5.3%; JAMA: 2.9%). The conditions studied were diverse, the most frequent being osteoarthritis (20.9%), and rheumatoid arthritis (20.2%).

LLMs were able to provide a readable JSON output on almost all abstracts, although one model (Deepseek 2.5) encountered more issues, especially for the drug/nondrug classification (Table 1, column format issue). Thus, the LLM decision could sometimes be missing, because there were not enough models’ responses to reach the decision threshold. Both tasks required approximately 2 to 5 seconds per abstract and between 16 and 23 GB of Video Random Access Memory (Vram) for all small LLMs, while it took around 145 GB of Vram and 9 seconds per abstract for the 2 larger models (Table 1).

Table 1 Accuracy of individual models, single human reviewer and human gold standard, for 2 tasks: The classification of the intervention (drug versus nondrug) and the number of randomized patients

Figure 2 Considering all combinations of LLMs, for a given number of LLM (x-axis) and a given agreement threshold (filling color), the figure presents the resulting accuracy of the agreement (top panels) and the corresponding percentage of data with an agreement result (bottom panels) for both tasks: classification of the intervention between drug and nondrug (left panels), or extraction of the number of randomized patients (right panels). When the boxplots were too thin, they were replaced by a diamond-like shape.

3.1 Improving the gold standard

The human gold standard (2 human reviewers and a human adjudicator) classified 447 (43.8%) abstracts as containing a drug as an intervention. In 88 cases, at least 4 LLMs classified the abstracts differently than the humans, and 38 (43.2%) were subsequently reclassified in favor of the LLM agreement decision. All but one were reclassified from a drug to a nondrug intervention RCT, leading to the final platinum standard of classifying 411 abstracts as containing a drug intervention: 447 abstracts initially classified as drug intervention, minus 37 reclassified as nondrug intervention plus one reclassified as drug intervention. This corresponds to an accuracy of 96.3% (982 correctly classified and 38 misclassified out of 1020) for the human gold standard. Similarly, for the extraction of the number of randomized patients, the agreement of at least 4 LLMs yielded different results than the human gold standard for 45 abstracts out of which 18 (40%) were subsequently changed in favor of the LLM agreement. Thus, the human gold standard had an accuracy of 98.3%. The cases where the agreement of 4 LLMs differed from the human gold standard were in a vast majority situation for which the 2 human reviewers disagreed, and adjudication was needed.

3.2 Performance

Overall, the human gold standard and most of the LLMs had better accuracy to extract sample size compared to drug classification (Table 1). Qwen 2.5 7b, Phi3 small, Ministral, Llama 3 8b, and gemma2 9b had a similar accuracy compared to single human reviewers for both tasks. Aya 8B, deepseek 7B, and granite 3.0 8B tended to be less accurate than human reviewers, while large models (70 billion parameters) were better than a single human reviewer for both tasks but still slightly less accurate than the human gold standard. Both types of single raters had a lower accuracy compared to the human gold standard, with most ranging close to 90% for the drug/ nondrug classification task, and close to 94% for the sample size task (Table 1).

Figure 2 presents the accuracy (top panel) and the percentage of abstracts with an agreement result (bottom panel) of the agreement of LLMs, for a given number of LLMs (x-axis) and a given agreement threshold (filling color). Combining the LLMs increased the accuracy of the agreement result, with the accuracy steadily increasing from below the human gold to nearly 100% when increasing the decision threshold. This increase in agreement was at the cost of a decreased percentage of abstracts obtaining an agreement. In other words, the number of abstracts having enough LLMs agreeing to reach an agreement decreased when increasing the decision threshold. For instance, when selecting combinations of 3 LLMs out of the 8 available, 2 agreement thresholds can be applied: either at least 2 models must agree, or all 3 must agree. The resulting accuracy distributions for these thresholds are shown in the first and second boxplots of the upper left panel in Figure 2, with median accuracies of 92.9% and 98.2%, respectively (see Supplementary Table S1 for the median and interquartile range of all accuracies). The corresponding proportions of abstracts for which agreement was reached under these thresholds are presented in the first and second boxplots of the lower left panel, with median agreement rates of 99.8% and 76.1%, respectively (Supplementary Table S2). Other performance metrics (precision, recall, and F1) showed similar patterns of results for all ratings (see Supplementary Tables S1–S14 for all performance metrics, including precision, recall or F1 measurements).

Considering together the number of abstracts receiving an LLM agreement rating and the accuracy, several combinations of the models reached a performance equal to or above the human gold standard accuracy, while still evaluating at least 85% of the abstracts: agreement of at least 3 models out of 4, agreement of at least 4 models out of 5, agreement of at least 4 models out of 6, or agreement of at least 5 models out of 7 (Figure 2). In more detail, an agreement of at least 4 models out of 5 was achieved for 86.6% (IQR: [84.5; 88.8]) and 89.9% (IQR: [86.2; 92.2]) of the abstracts for the intervention classification task and the sample size extraction task, respectively (Figure 2, bottom panels, light blue boxplot, for x-axis = 5, and Supplementary Tables 2 and 10, line for N model = 5 and Threshold = 4). This resulted in the correct result (that is, the platinum standard) of a median of 97.8% (IQR: [97.5; 98.1]) and 99.2% (IQR: [99.1; 99.4]) of the time (Figure 2, top panels, light blue boxplot, for x-axis = 5, and Supplementary Tables 1 and 9, line for N model = 5 and Threshold = 4). To achieve an optimal balance between agreement and accuracy, a more parsimonious approach would be to use the agreement of at least 3 out of 4 LLMs, a situation where an agreement is reached for 91.8% (IQR: [89.8; 93.6]) and 93.5% (IQR: [90.8; 95.5]) of the abstracts (Figure 2, bottom panels, light blue boxplot, for x-axis = 4, and Supplementary Tables 2 and 10, line for N model = 4 and Threshold = 3), yielding a median accuracy of 96.2% (IQR: [95.7; 96.7]) and 98.7% (IQR: [98.6; 99.0]), respectively (Figure 2, top panels, light blue boxplot, for x-axis = 4, and Supplementary Tables 1 and 9, line for N model = 4 and Threshold = 3).

The choice of the individual LLM had almost no influence on the decision accuracy (Supplementary Figures 1 and 2), primarily because combinations involving lower-performing models were less likely to reach agreement. This pattern is shown in Figure 3, which examines combinations of 5 LLMs in detail. The proportion of abstracts for which agreement was reached declined almost linearly with the individual accuracy of the selected LLMs, with steeper declines observed at higher decision thresholds. For instance, for the intervention classification task, the percentage of abstracts reaching agreement with 4 models (Figure 3, light blue dots) went from 90% to 80% when comparing the combination of the most accurate individual LLMs (on the right side of the graph) compared to the less accurate ones (on the left side of the graph), whereas the agreement of 5 LLMs changed from 70% to almost 40% (Figure 3, left panels).

Figure 3 Considering all combinations of 5 LLMs (one point per combination, 56 points in total) for both tasks: classification of the intervention (left panels), or extraction of the number of randomized patients (right panels): Top panels: comparison between the accuracy of the agreement results (vertical axis) and the mean accuracy of the individual LLMs. Dashed lines indicate the same values for agreement accuracy and individual accuracy. Bottom panels: comparison between the percentage of data with an agreement (vertical axis) and the mean accuracy of the individual LLMs. Mean accuracy is computed by taking the average of the accuracy of each LLM model for a given task. As we considered 8 LLMs, when studying the use of 5 LLMS, for example, we considered all possible combinations of LLMs within the 8 available, with each combination contributing one point in the figure (56 possible LLM combinations in total).

Changes of the models’ parameters did not affect the LLM individual performances. Indeed, the accuracy of Qwen 2.5 7B varied from 92.4% to 93.4% when changing parameters for the intervention classification task, and from 92.8% to 96.1% for the extraction of randomized participants, the best accuracy being obtained with the highest temperature (data not shown).

3.3 Subgroup analysis

The performance of the LLMs in detecting the 22 abstracts (out of 1020) that did not report any sample size was generally good, with a precision >90% when at least 3 LLMs classified the sample size as “missing,” while the recall remained at 90% (Supplementary Tables S9–S11).

For the sample size extraction task, the performance of the individual human reviewer was 23 percent points lower when calculating the sample size was required, such as abstracts presenting treatment groups separately, necessitating the summation of both arms to determine the total. This loss of individual performance also impacted the human gold standard. Although the adjudication corrected most of the initial human mistakes, situations where both reviewers did the same mistake or where the human adjudicator was wrong still left a 3 percent point difference between abstracts that did not require computation (99.9%) and abstracts that required computation (97%). In contrast, individual LLMs made less calculation errors when computing the sample size, with a difference in the accuracy between 3 and 6 percent point. As a result, the agreement obtained had a difference in accuracy of less than 1 percent point between situations that required a calculation and those that did not require it (Supplementary Tables S12 and S13).

4 Discussion

We analyzed abstracts from published RCTs in rheumatology to systematically study the accuracy of an agreement-based prediction of a combination of several LLMs of size inferior to 10 billion parameters. We focused on 2 key tasks of evidence synthesis: classification of an intervention as drug or nondrug and extracting the number of randomized patients. The performance of the LLM agreement was compared to a human gold standard. This agreement-based approach has 2 advantages. First, the LLMs decision of at least 4 models can be used to improve the human gold standard. Human reviewers could re-examine the abstracts for which human and LLM decisions differed, an important advantage as it is well known that even the gold standard resulting from the 2 reviewers and an adjudicator does not provide 100% accuracy.Reference Wang, Nayfeh, Tetzlaff, O’Blenis and Murad⁴⁰^, Reference Mathes, Klaßen and Pieper⁴¹ The second advantage is that requiring an agreement between a fixed number of LLMs before considering an answer as valid yielded accuracies surpassing both the human gold standard and the state-of-the-art LLMs. The accuracy of the results increased when the threshold (that is, number of LLMs needing to agree with each other) to reach a decision was increased, although this meant achieving an agreement on fewer abstracts. Consequently, this approach remained robust, regardless of the individual accuracy of the LLMs used, as lower-performing models produced fewer agreements.

This simple yet effective method addresses the issue raised by Zhou et al.,Reference Zhou, Schellaert, Martínez-Plumed, Moros-Daval, Ferri and Hernández-Orallo²⁴ who highlighted that LLMs are prone to confidently generate incorrect responses, especially for larger LLMs. Requiring a minimum number of LLMs to provide concordant answers before accepting a response as valid (and rejecting cases where disagreement occurs) enhanced reliability. Incorrect responses were naturally filtered out, as different LLMs trained on different datasets produced varying incorrect answers, making disagreements a strong indicator of inaccuracy. At the same time, this simple principle made it robust against the performance variation across models, potentially tackling the persistent problem of performance changes across models,Reference Li, Sun and Tan¹⁷^, Reference Dennstädt, Zink, Putora, Hastings and Cihoric²¹^, Reference Tang, Xiao and Li⁴² the prompt used,Reference Oami, Okada and Nakada¹⁵^, Reference Dennstädt, Zink, Putora, Hastings and Cihoric²¹^, Reference Syriani, David and Kumar²²^, Reference Khraisha, Put, Kappenberg, Warraitch and Hadfield⁴³ or even across time for the same proprietary model.Reference Aronson, Machini and Shin⁴⁴

Several approaches using multiple LLMs to improve accuracy have been suggested in the literature. One involves an active voting of LLMs,Reference Li, Zhang, Yu, Fu and Ye⁴⁵^, Reference Oniani, Hilsman, Dong, Gao, Verma and Wang⁴⁶ where the LLMs are given other models’ answers as prompt and must collectively determine the correct answer, potentially after debate.Reference Du, Li, Torralba, Tenenbaum and Mordatch⁴⁷ Another approach simply takes the majority answer from multiple LLMs, which have been experimented in some studiesReference Li, Sun and Tan¹⁷^, Reference Yang, Li, Zhou, Xiao, Fang and Zhang⁴⁸ and have shown an improvement in their performance. To our knowledge, only one study used a fixed decision threshold to accept or reject the answer proposed by various LLMs,Reference Khan, Ayub and Naqvi⁴⁹ with performance improvement in line with our study. It was, however, limited by the use of only 2 proprietary LLMs (GPT-4 and Claude 3).

Although this agreement-based approach improves accuracy, it implies that 10%–15% of the abstracts will not obtain a decision, due to the model’s conservative approach in cases where not enough models agree. To address these unclassified abstracts, 2 primary strategies can be employed. The first involves relying only on the human gold standard for these abstracts. The amount of time saved by using LLMs could potentially be invested into adding more human reviewers on these more difficult abstracts. The second approach entails using a single, larger model in conjunction with the human reviewer, allowing for more nuanced decision-making while maintaining efficiency. Both methods offer a means of resolving uncertainty while balancing precision and effort.

Our proposed method leverages the availability of numerous open-source LLMs, each with less than 10 billion parameters, requiring less than 20GB of VRam when performing inference, and able to run on consumer grade hardware. Many of these models already achieve performance levels comparable to human reviewers for the 2 tasks examined in this study, support large-context inputs, can be run locally, and are freely accessible.

This study has several strengths. First, our agreement-based approach uses openly available small LLMs, which only take a few seconds to analyze an abstract, corresponding to less than 7 hours to process all abstracts by 5 different models. Second, the factorial design considering all possible combinations of models among 8 LLMs provided a clear picture of the variability in performance due to the choice of models. Finally, the use of openly available data and code ensures adequate reproducibility.

The main limitation concerns the generalizability of the findings of this study. Though we considered 2 different tasks, and both had strikingly similar excellent accuracy, performance could be lower for other tasks. Similarly, performance could differ with a different sample of abstracts. Finally, accuracy of decisions based on full-text articles remains to be assessed.

5 Conclusion

The agreement-based approach to 2 tasks necessary for evidence synthesis yielded excellent accuracy (>95%) on at least 85% of abstracts, while taking less than 7 hours if used with 5 models. Although these promising results should be confirmed on other tasks and datasets, they pave the way to greatly facilitating evidence synthesis, as they could allow to replace human reviewers by artificial intelligence for most abstracts.

Author contributions

Data curation, validation, writing – original draft, writing – review and editing: DSC; data curation, writing - original draft, writing – review and editing: DB-G, NB; data curation, writing – review and editing: CPB; funding acquisition, writing – review and editing: MI; formal analysis, methodology, software, visualization, funding acquisition, writing – original draft, writing – review and editing: DM.

Competing Interest Statement

MI received consulting fees from Boehringer Ingelheim, received payment or honoraria for lectures, presentations, speakers bureaus, manuscript writing or educational events from Boehringer Ingelheim and CSL Vifor, and has participated in the Advisory Board for Novartis. The authors declare that no competing interests exist.

Data availability statement

All code and data can be openly found at the following registry: https://gitlab.unige.ch/trial_integrity/llm_majority_public, and at the Zenodo repository https://doi.org/10.5281/zenodo.15829040.Reference Mongin⁵⁰

Funding statement

This study has been funded by the Swiss National Science Foundation Grant number 212393 “Fostering transparency in rheumatology randomized clinical trials.”

Role of the funder/sponsor

The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Supplementary material

To view supplementary material for this article, please visit http://doi.org/10.1017/rsm.2025.10054.

Footnotes

This article was awarded Open Data and Open Materials badges for transparent practices. See the Data availability statement for details.

References

The value of evidence synthesis. Nat Hum Behav. 2021;5(5):539–539. https://doi.org/10.1038/s41562-021-01131-7.CrossRef Google Scholar

Ioannidis, JPA. Meta-research: why research on research matters. PLoS Biol. 2018;16(3):e2005468. https://doi.org/10.1371/journal.pbio.2005468.CrossRef Google Scholar PubMed

Bornmann, L, Haunschild, R, Mutz, R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanit Soc Sci Commun. 2021;8(1):224. https://doi.org/10.1057/s41599-021-00903-w.CrossRef Google Scholar

Ghasemi, A, Mirmiran, P, Kashfi, K, Bahadoran, Z. Scientific publishing in biomedicine: a brief history of Scientific Journals. Int J Endocrinol Metab. 2023;21(1):e131812. https://doi.org/10.5812/ijem-131812.Google Scholar PubMed

Bornmann, L, Mutz, R. Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J Assoc Inf Sci Technol. 2015;66(11):2215–2222. https://doi.org/10.1002/asi.23329 CrossRef Google Scholar

Borah, R, Brown, A, Capers, P, Kaiser, K. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7:e012545. https://doi.org/10.1136/bmjopen-2016-012545.CrossRef Google Scholar PubMed

Affengruber, L, van der Maten, MM, Spiero, I, et al. An exploration of available methods and tools to improve the efficiency of systematic review production: a scoping review. BMC Med Res Methodol. 2024;24(1):210. https://doi.org/10.1186/s12874-024-02320-4.CrossRef Google Scholar PubMed

Aum, S, Choe, S. srBERT: automatic article classification model for systematic review using BERT. Syst Rev. 2021;10(1):285. https://doi.org/10.1186/s13643-021-01763-w.CrossRef Google Scholar PubMed

Minaee, S, Mikolov, T, Nikzad, N, et al. Large language models: a survey. arXiv. 2024;20. https://doi.org/10.48550/arXiv.2402.06196.Google Scholar

Katz, DM, Bommarito, MJ, Gao, S, Arredondo, P. GPT-4 passes the bar exam. 382 Philos Trans R Soc A. 2023. https://doi.org/10.2139/ssrn.4389233.Google Scholar

Fijačko, N, Creber, RM, Abella, BS, et al. Using generative artificial intelligence in bibliometric analysis: 10 years of research trends from the European resuscitation congresses. Resusc Plus. 2024;18:100584. https://doi.org/10.1016/j.resplu.2024.100584.CrossRef Google Scholar PubMed

Luo, X, Rechardt, A, Sun, G, et al. Large language models surpass human experts in predicting neuroscience results. Nat Hum Behav. 2024. doi:https://doi.org/10.1038/s41562-024-02046-9.CrossRef Google Scholar PubMed

Goh, E, Gallo, R, Hom, J, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open. 2024;7(10):e2440969. https://doi.org/10.1001/jamanetworkopen.2024.40969.CrossRef Google Scholar PubMed

Liang, W, Zhang, Y, Cao, H, et al. Can large language models provide useful feedback on research papers? A large-scale empirical analysis. NEJM AI. 2024;1(8):AIoa2400196. https://doi.org/10.1056/AIoa2400196.CrossRef Google Scholar

Oami, T, Okada, Y, Nakada, T a. Performance of a large language model in screening citations. JAMA Netw Open. 2024;7(7):e2420496–e2420496. https://doi.org/10.1001/jamanetworkopen.2024.20496.CrossRef Google Scholar PubMed

Tran, VT, Gartlehner, G, Yaacoub, S, et al. Sensitivity and specificity of using GPT-3.5 turbo models for title and abstract screening in systematic reviews and meta-analyses. Ann Intern Med. 2024;177(6):791–799. https://doi.org/10.7326/M23-3389.CrossRef Google Scholar PubMed

Li, M, Sun, J, Tan, X. Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Syst Rev. 2024;13(1):219. https://doi.org/10.1186/s13643-024-02609-x.CrossRef Google Scholar PubMed

Gartlehner, G, Kahwati, L, Hilscher, R, et al. Data extraction for evidence synthesis using a large language model: a proof-of-concept study. Res Synth Methods. 2024;15(4):576–589. https://doi.org/10.1002/jrsm.1710.CrossRef Google Scholar PubMed

Lieberum, JL, Töws, M, Metzendorf, MI, et al. Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review. J Clin Epidemiol. 2025;181:111746. https://doi.org/10.1016/j.jclinepi.2025.111746.CrossRef Google Scholar

Chae, Y, Davidson, T. Large language models for text classification: from zero-shot learning to fine-tuning. Sociol Methods Res. 2023. https://doi.org/10.31235/osf.io/sthwk.Google Scholar

Dennstädt, F, Zink, J, Putora, PM, Hastings, J, Cihoric, N. Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev. 2024;13(1):158. https://doi.org/10.1186/s13643-024-02575-4.CrossRef Google Scholar PubMed

Syriani, E, David, I, Kumar, G. Screening articles for systematic reviews with ChatGPT. J Comp Lang. 2024;80:101287. https://doi.org/10.1016/j.cola.2024.101287.Google Scholar

Nguyen-Trung, K, Saeri, AK, Kaufman, S. Applying ChatGPT and AI-powered tools to accelerate evidence reviews. Hum Behav Emerg Technol. 2024;2024(1):8815424. https://doi.org/10.1155/2024/8815424.CrossRef Google Scholar

Zhou, L, Schellaert, W, Martínez-Plumed, F, Moros-Daval, Y, Ferri, C, Hernández-Orallo, J. Larger and more instructable language models become less reliable. Nature. 2024;634(8032):61–68. https://doi.org/10.1038/s41586-024-07930-y.CrossRef Google Scholar PubMed

Ong, JCL, Chang, SYH, William, W, et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit Health. 2024;6(6):e428–e432. https://doi.org/10.1016/S2589-7500(24)00061-X.CrossRef Google Scholar PubMed

Mongin, D, Buitrago-Garcia, D, Capderou, S, et al. Prospective registration of trials: where we are, why, and how we could get better. J Clin Epidemiol. 2024;176. https://doi.org/10.1016/j.jclinepi.2024.111586.Google Scholar PubMed

Policy Commons. Methodological Expectations of Cochrane Intervention Reviews (MECIR) Standards for the conduct and reporting of. Accessed August 28, 2024. https://policycommons.net/artifacts/1712743/methodological-expectations-of-cochrane-intervention-reviews-mecir-standards-for-the-conduct-and-reporting-of/2444392/ Google Scholar

Meta AI. Llama-3-8B-Instruct. December 6, 2024. Accessed December 9, 2024. https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct Google Scholar

MistralAI. Ministral-8B-Instruct-2410. October 16, 2024. Accessed December 9, 2024. https://huggingface.co/mistralai/Ministral-8B-Instruct-2410 Google Scholar

Qwen. Qwen2.5-7B-Instruct. November 28, 2024. Accessed December 9, 2024. https://huggingface.co/Qwen/Qwen2.5-7B-Instruct Google Scholar

01 AI. Yi-1.5-9B-Chat. May 20, 2024. Accessed December 9, 2024. https://huggingface.co/01-ai/Yi-1.5-9B-Chat Google Scholar

Google. gemma-2-9b-it. December 5, 2024. Accessed December 9, 2024. https://huggingface.co/google/gemma-2-9b-it Google Scholar

Deepseek, AI. deepseek-llm-7b-chat. August 16, 2024. Accessed December 9, 2024. https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat Google Scholar

Microsoft. Phi-3-small-128k-instruct. November 14, 2024. Accessed December 9, 2024. https://huggingface.co/microsoft/Phi-3-small-128k-instruct Google Scholar

Cohere for AI. aya-expanse-8b. December 3, 2024. Accessed December 9, 2024. https://huggingface.co/CohereForAI/aya-expanse-8b Google Scholar

Meta AI. Llama-3.3-70B-Instruct. December 6, 2024. Accessed February 4, 2025. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct Google Scholar

Qwen. Qwen2.5-72B. November 28, 2024. Accessed February 4, 2025. https://huggingface.co/Qwen/Qwen2.5-72B Google Scholar

Brown, T, Mann, B, Ryder, N, et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (editors). Advances in Neural Information Processing Systems. Vol 33. Curran Associates, Inc.; 2020:1877–1901. Accessed February 20, 2024. https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html Google Scholar

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; 2019. https://www.R-project.org Google Scholar

Wang, Z, Nayfeh, T, Tetzlaff, J, O’Blenis, P, Murad, MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS One. 2020;15(1):e0227742. https://doi.org/10.1371/journal.pone.0227742.CrossRef Google Scholar PubMed

Mathes, T, Klaßen, P, Pieper, D. Frequency of data extraction errors and methods to increase data extraction quality: a methodological review. BMC Med Res Methodol. 2017;17(1):152. https://doi.org/10.1186/s12874-017-0431-4.CrossRef Google Scholar PubMed

Tang, Y, Xiao, Z, Li, X, et al. Large language model in medical information extraction from titles and abstracts with prompt engineering strategies: a comparative study of GPT-3.5 and GPT-4. Published online March 21, 2024. https://doi.org/10.1101/2024.03.20.24304572.CrossRef Google Scholar

Khraisha, Q, Put, S, Kappenberg, J, Warraitch, A, Hadfield, K. Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res Synth Methods. https://doi.org/10.1002/jrsm.1715.Google Scholar

Aronson, SJ, Machini, K, Shin, J, et al. Preparing to integrate generative pretrained transformer series 4 models into genetic variant assessment workflows: assessing performance, drift, and nondeterminism characteristics relative to classifying functional evidence in literature. arXiv. Published online February 2024;16. https://doi.org/10.48550/arXiv.2312.13521.Google Scholar

Li, J, Zhang, Q, Yu, Y, Fu, Q, Ye, D. More agents is all you need. arXiv. Published online February 3, 2024. https://doi.org/10.48550/arXiv.2402.05120.Google Scholar

Oniani, D, Hilsman, J, Dong, H, Gao, F, Verma, S, Wang, Y. Large language models vote: prompting for rare disease identification. arXiv. Published online January 23, 2024. https://doi.org/10.48550/arXiv.2308.12890.Google Scholar

Du, Y, Li, S, Torralba, A, Tenenbaum, JB, Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. arXiv. Published online May 23, 2023. https://doi.org/10.48550/arXiv.2305.14325.Google Scholar

Yang, H, Li, M, Zhou, H, Xiao, Y, Fang, Q, Zhang, R. One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering. medRxiv. Published online December 24, 2023. https://doi.org/10.1101/2023.12.21.23300380.Google Scholar

Khan, MA, Ayub, U, Naqvi, SAA, et al. Collaborative large language models for automated data extraction in living systematic reviews. J Am Med Inform Assoc. Published online January 21, 2025:ocae325. https://doi.org/10.1093/jamia/ocae325.Google Scholar PubMed

Mongin, D. Data and code for “Beyond Human Gold Standards: A Multi-Model Framework for Automated Abstract Classification and Information Extraction” article. Published online July 7, 2025. Accessed July 7, 2025. https://zenodo.org/records/15829040 Google Scholar

Figure 1 Structure of the prompt for the 2 classification tasks.

Table 1 Accuracy of individual models, single human reviewer and human gold standard, for 2 tasks: The classification of the intervention (drug versus nondrug) and the number of randomized patients

Courvoisier et al. supplementary material

File 702.5 KB

Article contents

Beyond human gold standards: A multimodel framework for automated abstract classification and information extraction

Abstract

Keywords

Information

Highlights

1 Introduction

2 Methods

2.1 Abstracts analyzed

2.2 List of LLMs considered

2.3 The gold standard

2.4 The platinum standard

2.5 LLM inference

2.6 Prompt and parameters

2.7 Statistical analysis

3 Results

3.1 Improving the gold standard

3.2 Performance

3.3 Subgroup analysis

4 Discussion

5 Conclusion

Author contributions

Competing Interest Statement

Data availability statement

Funding statement

Role of the funder/sponsor

Supplementary material

Footnotes

References

Courvoisier et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests