Hostname: page-component-77f85d65b8-g4pgd Total loading time: 0 Render date: 2026-03-30T02:39:50.827Z Has data issue: false hasContentIssue false

Assessing risk of bias of cohort studies with large language models

Published online by Cambridge University Press:  07 August 2025

Danni Xia
Affiliation:
Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
Honghao Lai
Affiliation:
Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
Weilong Zhao
Affiliation:
Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
Jiajie Huang
Affiliation:
College of Nursing, Gansu University of Chinese Medicine, Lanzhou, China
Jiayi Liu
Affiliation:
Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
Ziying Ye
Affiliation:
Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
Jianing Liu
Affiliation:
College of Nursing, Gansu University of Chinese Medicine, Lanzhou, China
Mingyao Sun
Affiliation:
School of Nursing, Peking University , Beijing, China
Liangying Hou
Affiliation:
Department of Health Research Methods, Evidence, and Impact, McMaster University , Hamilton, Ontario, Canada Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
Bei Pan
Affiliation:
Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
Long Ge*
Affiliation:
Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China Key Laboratory of Evidence Based Medicine of Gansu Province, Lanzhou, China
*
Corresponding author: Long Ge; Email: gelong2009@163.com
Rights & Permissions [Opens in a new window]

Abstract

This study aims to explore the feasibility and accuracy of utilizing large language models (LLMs) to assess the risk of bias (ROB) in cohort studies. We conducted a pilot and feasibility study in 30 cohort studies randomly selected from reference lists of published Cochrane reviews. We developed a structured prompt to guide the ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 to assess the ROB of each cohort twice. We used the ROB results assessed by three evidence-based medicine experts as the gold standard, and then we evaluated the accuracy of LLMs by calculating the correct assessment rate, sensitivity, specificity, and F1 scores for overall and item-specific levels. The consistency of the overall and item-specific assessment results was evaluated using Cohen’s kappa (κ) and prevalence-adjusted bias-adjusted kappa. Efficiency was estimated by the mean assessment time required. This study assessed three LLMs (ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3) and revealed distinct performance across eight assessment items. Overall accuracy was comparable (80.8%–83.3%). Moonshot-v1-128k showed superior sensitivity in population selection (0.92 versus ChatGPT-4o’s 0.55, P < 0.001). In terms of F1 scores, Moonshot-v1-128k led in population selection (F = 0.80 versus ChatGPT-4o’s 0.67, P = 0.004). ChatGPT-4o demonstrated the highest consistency (mean κ = 96.5%), with perfect agreement (100%) in outcome confidence. ChatGPT-4o was 97.3% faster per article (32.8 seconds versus 20 minutes manually) and outperformed Moonshot-v1-128k and DeepSeek-V3 by 47–50% in processing speed. The efficient and accurate assessment of ROB in cohort studies by ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 highlights the potential of LLMs to enhance the systematic review process.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-ShareAlike licence (https://creativecommons.org/licenses/by-sa/4.0), which permits re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 Flow diagram of the main study process.

Figure 1

Figure 2 Heatmap of accuracy assessment rates.

Figure 2

Table 1 Accuracy of assessments

Figure 3

Figure 3 Performance comparison of the assessment.

Figure 4

Table 2 Cohen’s kappa and PABAK comparison

Figure 5

Figure 4 Consistent assessment rate.

Supplementary material: File

Xia et al. supplementary material

Xia et al. supplementary material
Download Xia et al. supplementary material(File)
File 697.5 KB