Hostname: page-component-6766d58669-rxg44 Total loading time: 0 Render date: 2026-05-14T12:32:40.521Z Has data issue: false hasContentIssue false

A comprehensive systematic review dataset is a rich resource for training and evaluation of AI systems for title and abstract screening

Published online by Cambridge University Press:  07 March 2025

Gary C. K. Chan*
Affiliation:
School of Computing Technologies, RMIT University, Melbourne, VIC, Australia National Centre for Youth Substance Use Research, University of Queensland, Brisbane, QLD, Australia
Estrid He
Affiliation:
School of Computing Technologies, RMIT University, Melbourne, VIC, Australia
Janni Leung
Affiliation:
National Centre for Youth Substance Use Research, University of Queensland, Brisbane, QLD, Australia
Karin Verspoor
Affiliation:
School of Computing Technologies, RMIT University, Melbourne, VIC, Australia School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
*
Corresponding author: Gary C. K. Chan; Email: c.chan4@uq.edu.au
Rights & Permissions [Opens in a new window]

Abstract

When conducting a systematic review, screening the vast body of literature to identify the small set of relevant studies is a labour-intensive and error-prone process. Although there is an increasing number of fully automated tools for screening, their performance is suboptimal and varies substantially across review topic areas. Many of these tools are only trained on small datasets, and most are not tested on a wide range of review topic areas. This study presents two systematic review datasets compiled from more than 8600 systematic reviews and more than 540000 abstracts covering 51 research topic areas in health and medical research. These datasets are the largest of their kinds to date. We demonstrate their utility in training and evaluating language models for title and abstract screening. Our dataset includes detailed metadata of each review, including title, background, objectives and selection criteria. We demonstrated that a small language model trained on this dataset with additional metadata has excellent performance with an average recall above 95% and specificity over 70% across a wide range of review topic areas. Future research can build on our dataset to further improve the performance of fully automated tools for systematic review title and abstract screening.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Table 1 Descriptive statistics for the training set, validation set and the three test sets

Figure 1

Figure 1 Classification model (left) and relevance model (right).

Figure 2

Table 2 Model performance of the 4 models. False negatives are bold and false positives are underlined

Figure 3

Table 3 Review-level analysis on the relevance model

Figure 4

Table 4 Descriptive statistics of a small systematic review data by simulating manual search

Figure 5

Figure 2 False-negative rate by reviews.

Figure 6

Figure 3 False-positive rate by reviews.

Supplementary material: File

Chan et al. supplementary material

Chan et al. supplementary material
Download Chan et al. supplementary material(File)
File 249.2 KB