Hostname: page-component-77f85d65b8-5ngxj Total loading time: 0 Render date: 2026-03-28T01:43:01.043Z Has data issue: false hasContentIssue false

Automating the data extraction process for systematic reviews using GPT-4o and o3

Published online by Cambridge University Press:  17 September 2025

Yuki Kataoka*
Affiliation:
Department of Internal Medicine, Kyoto Min-iren Asukai Hospital, Kyoto, Japan Scientific Research WorkS Peer Support Group (SRWS-PSG), Osaka, Japan Department of Healthcare Epidemiology, Kyoto University Graduate School of Medicine/School of Public Health, Kyoto, Japan Department of International and Community Oral Health, Tohoku University Graduate School of Dentistry, Sendai, Japan
Tomohiro Takayama
Affiliation:
Faculty of Medicine, Kyoto University, Kyoto, Japan Fitting Cloud Inc., Kyoto, Japan
Keisuke Yoshimura
Affiliation:
Faculty of Medicine, Kyoto University, Kyoto, Japan
Ryuhei So
Affiliation:
Scientific Research WorkS Peer Support Group (SRWS-PSG), Osaka, Japan Department of Psychiatry, Okayama Psychiatric Medical Center , Okayama, Japan CureApp, Inc., Tokyo, Japan
Yasushi Tsujimoto
Affiliation:
Scientific Research WorkS Peer Support Group (SRWS-PSG), Osaka, Japan Oku Medical Clinic, Osaka, Japan Department of Health Promotion and Human Behavior, Kyoto University Graduate School of Medicine/School of Public Health, Kyoto University, Kyoto, Japan
Yosuke Yamagishi
Affiliation:
Division of Radiology and Biomedical Engineering, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
Shiro Takagi
Affiliation:
Independent Researcher
Yuki Furukawa
Affiliation:
Department of Neuropsychiatry, University of Tokyo, Tokyo, Japan
Masatsugu Sakata
Affiliation:
Department of Health Promotion and Human Behavior, Kyoto University Graduate School of Medicine/School of Public Health, Kyoto University, Kyoto, Japan Department of Neurodevelopmental Disorders, Nagoya City University Graduate School of Medical Sciences, Nagoya, Japan
Đorđe Bašić
Affiliation:
Faculty of Behavioural and Movement Sciences, Clinical Psychology, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Andrea Cipriani
Affiliation:
Department of Psychiatry, University of Oxford, Oxford, UK Oxford Precision Psychiatry Lab, NIHR Oxford Health Biomedical Research Centre, Oxford, UK Oxford Health National Health Service Foundation Trust, Warneford Hospital, Oxford, UK
Pim Cuijpers
Affiliation:
Department of Clinical, Neuro- and Developmental Psychology, WHO Collaborating Center for Research and Dissemination of Psychological Interventions, Amsterdam Public Health Institute, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Eirini Karyotaki
Affiliation:
Department of Clinical, Neuro- and Developmental Psychology, WHO Collaborating Center for Research and Dissemination of Psychological Interventions, Amsterdam Public Health Institute, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Mathias Harrer
Affiliation:
Department of Clinical, Neuro- and Developmental Psychology, WHO Collaborating Center for Research and Dissemination of Psychological Interventions, Amsterdam Public Health Institute, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands Section for Evidence-Based Medicine in Psychiatry and Psychotherapy, Department of Psychiatry and Psychotherapy, School of Medicine and Health, Technical University of Munich, Munich, Germany
Stefan Leucht
Affiliation:
Section for Evidence-Based Medicine in Psychiatry and Psychotherapy, Department of Psychiatry and Psychotherapy, School of Medicine and Health, Technical University of Munich, Munich, Germany
Ava Homiar
Affiliation:
Department of Psychiatry, University of Oxford, Oxford, UK Oxford Precision Psychiatry Lab, NIHR Oxford Health Biomedical Research Centre, Oxford, UK
Edoardo G. Ostinelli
Affiliation:
Department of Psychiatry, University of Oxford, Oxford, UK Oxford Precision Psychiatry Lab, NIHR Oxford Health Biomedical Research Centre, Oxford, UK Oxford Health National Health Service Foundation Trust, Warneford Hospital, Oxford, UK
Clara Miguel
Affiliation:
Department of Clinical, Neuro- and Developmental Psychology, WHO Collaborating Center for Research and Dissemination of Psychological Interventions, Amsterdam Public Health Institute, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Alessandro Rodolico
Affiliation:
Section for Evidence-Based Medicine in Psychiatry and Psychotherapy, Department of Psychiatry and Psychotherapy, School of Medicine and Health, Technical University of Munich, Munich, Germany
Toshi A. Furukawa
Affiliation:
Kyoto University Office of Institutional Advancement and Communications, Kyoto, Japan
*
Corresponding author: Yuki Kataoka, Email: youkiti@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Large language models have shown promise for automating data extraction (DE) in systematic reviews (SRs), but most existing approaches require manual interaction. We developed an open-source system using GPT-4o to automatically extract data with no human intervention during the extraction process. We developed the system on a dataset of 290 randomized controlled trials (RCTs) from a published SR about cognitive behavioral therapy for insomnia. We evaluated the system on two other datasets: 5 RCTs from an updated search for the same review and 10 RCTs used in a separate published study that had also evaluated automated DE. We developed the best approach across all variables in the development dataset using GPT-4o. The performance in the updated-search dataset using o3 was 74.9% sensitivity, 76.7% specificity, 75.7 precision, 93.5% variable detection comprehensiveness, and 75.3% accuracy. In both datasets, accuracy was higher for string variables (e.g., country, study design, drug names, and outcome definitions) compared with numeric variables. In the third external validation dataset, GPT-4o showed a lower performance with a mean accuracy of 84.4% compared with the previous study. However, by adjusting our DE method, while maintaining the same prompting technique, we achieved a mean accuracy of 96.3%, which was comparable to the previous manual extraction study. Our system shows potential for assisting the DE of string variables alongside a human reviewer. However, it cannot yet replace humans for numeric DE. Further evaluation across diverse review contexts is needed to establish broader applicability.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 Study process overview. The flowchart depicted in this figure illustrates the overall process of development and external validation of our automated data extraction system. Meta-prompt: a set of instructions given to the large language model (LLM) to instruct it to perform a specific task. Prompting: the process of providing a meta-prompt and input to an LLM to retrieve a desired output. For detailed explanations of individual methods and techniques, please refer to the corresponding sections in Section 2.

Figure 1

Figure 2 Schematic representation of GPT-4o-based data extraction (DE) process for systematic reviews. This figure illustrates the input provided to GPT-4o and the corresponding response in the context of RCT DE. The input section shows a meta-prompt containing instructions for GPT-4o, along with specifications for the output style, including variable names accompanied by their descriptions, as well as sample RCT data encompassing text, tables, and figures. The response section demonstrates the structured output format that GPT-4o uses to present the extracted data.

Figure 2

Figure 3 Three prompting techniques to optimize meta-prompts. This figure illustrates three different methods for optimizing meta-prompts, including variable descriptions using GPT-4o. Each method starts with the first variable descriptions as input and processes RCT data differently to generate optimized meta-prompts. The contextual chat and one-by-one methods iterate through RCTs individually, whereas the conventional method processes all RCT data at once.

Figure 3

Figure 4 Data extraction evaluation metrics. This figure illustrates the metrics used to evaluate the system. LLM, large language model.

Figure 4

Figure 5 Development of the first meta-prompt (variable description). This figure outlines the process for creating the first variable descriptions for data extraction (DE) in systematic reviews (SRs). The inputs provided to GPT-4o included a meta-prompt with specific instructions for an SR, SR-level data, and a DE manual. The output is an array of objects in a JavaScript object notation (JSON) structure containing variables and their detailed descriptions, generated entirely by GPT-4o based on these inputs. JSON is a simple, structured data format commonly used for text analysis. We adopt a JSON format due to its high representation capacity.

Figure 5

Figure 6 Flowchart of the number of RCTs, arms, and variables examined across different training methods. This figure details the breakdown of RCTs, arms, and variables used in the evaluation of three prompting techniques in 10-fold cross-validation. The top section shows the initial sampling of RCTs for training and evaluation. The middle section details the reasons for excluding various RCTs from the initial sample. API errors occurred when processing articles with many pages, leading to incomplete text extraction. GPT-4o sometimes misidentified the number of trial arms, creating data mismatches. Some trials included in the overall dataset did not undergo data extraction for meta-analysis. The bottom tables present the final counts of RCTs, arms, and variables used in the analysis for each training scenario across the three methods.

Figure 6

Table 1 Performance of three prompting techniques to optimize numeric variable descriptions

Figure 7

Table 2 Performance of the chat-5-RCT method with modifications in Dataset 1

Figure 8

Figure 7 Sensitivity for all variables by the chat-5-RCT method with modifications in Dataset 1. Square: mean. Horizontal line: standard deviation. Some variables lack data points because a human reviewer extracted all relevant information, leaving no examples of “missing” data to calculate specificity against.

Figure 9

Figure 8 Specificity for all variables by the chat-5-RCT method with modifications in Dataset 1. Square: mean. Horizontal line: standard deviation. Some variables lack data points because a human reviewer extracted all relevant information, leaving no examples of “missing” data to calculate specificity against.

Figure 10

Figure 9 Precision for all variables by the chat-5-RCT method with modifications in Dataset 1. Square: mean. Horizontal line: standard deviation. Some variables lack data points because a human reviewer extracted all relevant information, leaving no examples of “missing” data to calculate specificity against.

Figure 11

Figure 10 Variable detection comprehensiveness for all variables by the chat-5-RCT method with modifications in Dataset 1. Square: mean. Horizontal line: standard deviation. Some variables lack data points because a human reviewer extracted all relevant information, leaving no examples of “missing” data to calculate specificity against.

Figure 12

Figure 11 Accuracy for all variables by the chat-5-RCT method with modifications in Dataset 1. Square: mean. Horizontal line: standard deviation.

Figure 13

Table 3 Comparison of prompting techniques for data extraction in Dataset 1

Figure 14

Table 4 All-in-one data extraction method in Dataset 2

Figure 15

Table 5 Comparison of the data extraction method in Dataset 3

Supplementary material: File

Kataoka et al. supplementary material

Kataoka et al. supplementary material
Download Kataoka et al. supplementary material(File)
File 267.6 KB