Hostname: page-component-89b8bd64d-mmrw7 Total loading time: 0 Render date: 2026-05-08T11:19:22.778Z Has data issue: false hasContentIssue false

Exploring the potential of Claude 2 for risk of bias assessment: Using a large language model to assess randomized controlled trials with RoB 2

Published online by Cambridge University Press:  12 March 2025

Angelika Eisele-Metzger*
Affiliation:
Institute for Evidence in Medicine, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany Cochrane Germany, Cochrane Germany Foundation, Freiburg, Germany
Judith-Lisa Lieberum
Affiliation:
Eye Center, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany
Markus Toews
Affiliation:
Institute for Evidence in Medicine, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany
Waldemar Siemens
Affiliation:
Institute for Evidence in Medicine, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany
Felix Heilmeyer
Affiliation:
Institute for Digitalization in Medicine, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany
Christian Haverkamp
Affiliation:
Institute for Digitalization in Medicine, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany
Daniel Boehringer
Affiliation:
Eye Center, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany
Joerg J. Meerpohl
Affiliation:
Institute for Evidence in Medicine, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany Cochrane Germany, Cochrane Germany Foundation, Freiburg, Germany
*
Corresponding author: Angelika Eisele-Metzger; Email: angelika.eisele-metzger@uniklinik-freiburg.de
Rights & Permissions [Opens in a new window]

Abstract

Systematic reviews are essential for evidence-based health care, but conducting them is time- and resource-consuming. To date, efforts have been made to accelerate and (semi-)automate various steps of systematic reviews through the use of artificial intelligence (AI) and the emergence of large language models (LLMs) promises further opportunities. One crucial but complex task within systematic review conduct is assessing the risk of bias (RoB) of included studies. Therefore, the aim of this study was to test the LLM Claude 2 for RoB assessment of 100 randomized controlled trials, published in English language from 2013 onwards, using the revised Cochrane risk of bias tool (‘RoB 2’; involving judgements for five specific domains and an overall judgement). We assessed the agreement of RoB judgements by Claude with human judgements published in Cochrane reviews. The observed agreement between Claude and Cochrane authors ranged from 41% for the overall judgement to 71% for domain 4 (‘outcome measurement’). Cohen’s κ was lowest for domain 5 (‘selective reporting’; 0.10 (95% confidence interval (CI): −0.10–0.31)) and highest for domain 3 (‘missing data’; 0.31 (95% CI: 0.10–0.52)), indicating slight to fair agreement. Fair agreement was found for the overall judgement (Cohen’s κ: 0.22 (95% CI: 0.06–0.38)). Sensitivity analyses using alternative prompting techniques or the more recent version Claude 3 did not result in substantial changes. Currently, Claude’s RoB 2 judgements cannot replace human RoB assessment. However, the potential of LLMs to support RoB assessment should be further explored.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 PRISMA flow chart illustrating the search process for Cochrane reviews of interventions. *Cochrane reviews with no RCTs, only RCTs published before 2013, only cluster or cross-over RCTs, or a combination of these reasons.

Figure 1

Table 1 Risk of bias judgements of Claude 2 and Cochrane authors (number of judgements per RoB 2 domain, n = 100 RCTs)

Figure 2

Table 2 Overall risk of bias judgements of Claude 2 tabulated against the overall judgements of the Cochrane authors (n = 100 RCTs)

Figure 3

Figure 2 Sankey diagram illustrating differing and congruent overall risk of bias judgements of the Cochrane authors and Claude 2. An animated version of this figure can be accessed via https://osf.io/2phyt.

Figure 4

Table 3 Performance of Claude 2 compared to the Cochrane authors (n = 100 RCTs)

Figure 5

Table 4 Examples for two-level discrepancies between Claude and reference standard, with comments and suggested judgement by the authors of this article

Supplementary material: File

Eisele-Metzger et al. supplementary material

Eisele-Metzger et al. supplementary material
Download Eisele-Metzger et al. supplementary material(File)
File 35.4 KB