Hostname: page-component-89b8bd64d-shngb Total loading time: 0 Render date: 2026-05-06T11:11:30.460Z Has data issue: false hasContentIssue false

Can machine learning help accelerate article screening for systematic reviews? Yes, when article separability in embedding space is high

Published online by Cambridge University Press:  10 March 2025

Farhan Ali*
Affiliation:
National Institute of Education, Nanyang Technological University, Singapore, Singapore
Amanda Swee-Ching Tan
Affiliation:
National Institute of Education, Nanyang Technological University, Singapore, Singapore
Serena Jun-Wei Wang
Affiliation:
College of Computing and Data Science, Nanyang Technological University, Singapore, Singapore
*
Corresponding author: Farhan Ali; Email: farhan.ali@nie.edu.sg
Rights & Permissions [Opens in a new window]

Abstract

Systematic reviews play important roles but manual efforts can be time-consuming given a growing literature. There is a need to use and evaluate automated strategies to accelerate systematic reviews. Here, we comprehensively tested machine learning (ML) models from classical and deep learning model families. We also assessed the performance of prompt engineering via few-shot learning of GPT-3.5 and GPT-4 large language models (LLMs). We further attempted to understand when ML models can help automate screening. These ML models were applied to actual datasets of systematic reviews in education. Results showed that the performance of classical and deep ML models varied widely across datasets, ranging from 1.2 to 75.6% of work saved at 95% recall. LLM prompt engineering produced similarly wide performance variation. We searched for various indicators of whether and how ML screening can help. We discovered that the separability of clusters of relevant versus irrelevant articles in high-dimensional embedding space can strongly predict whether ML screening can help (overall R = 0.81). This simple and generalizable heuristic applied well across datasets and different ML model families. In conclusion, ML screening performance varies tremendously, but researchers and software developers can consider using our cluster separability heuristic in various ways in an ML-assisted screening pipeline.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Table 1 Details of systematic review datasets analyzed in the present study

Figure 1

Table 2 ML hyperparameters for non-CNN and CNN family of models

Figure 2

Figure 1 Performance metric (WSS @ 95%) for non-CNN models across all datasets. Refer to Section 2 and Supplementary Figure 1 for computation of metric. The higher the WSS @ 95%, the better ML screening is in terms of work saved. Datasets ordered based on median performance. For classifiers, rf = random forest, sv = support vector machine, lo = logistic regression, nn = 2-layer neural network, base = LSTM base, pool = LSTM with pooling. For feature extraction, do = doc2vec, sb = sentence BERT, em = embedding-LSTM, tf = TD-IDF.

Figure 3

Figure 2 Performance metric (WSS @ 95%) for CNN models across all datasets. CNN models were switching models that started initially with manual feature extraction followed by shallow classifiers before switching to CNN models. Datasets ordered based on Figure 1 ordering. Refer to Section 2 for details. Same labels as Figure 1 for models preswitching.

Figure 4

Figure 3 Performance metric (Matthew’s correlation coefficient [MCC]) for LLM prompt engineering. Datasets ordered based on Figure 1 ordering.

Figure 5

Figure 4 ML screening performance (WSS @ 95%) for non-CNN models as a function of separability of relevant and irrelevant article clusters as measured by DBS. Each dataset is made up of 14 datapoints due to the combination of embedding and classifier hyperparameters (see Table 2). Non-CNN best-fit exponential curve is the main curve fitted to the datapoints. The other curves are force-fitted to the datapoints using parameters from the other model families (CNN and LLM) to compare to the non-CNN best fit exponential curve. In general, parameters from the other model families can almost equally fit the datapoints, suggesting the universality of an exponential relationship explaining ML screening performance as a function of cluster separability regardless of model families.

Figure 6

Figure 5 ML screening performance (WSS @ 95%) for CNN models as a function of separability of relevant and irrelevant article clusters as measured by DBS. Each dataset is made up of 12 datapoints due to the combination of starting models of embedding and classifier hyperparameters before switching to CNN (see Table 2). CNN best-fit exponential curve is the main curve fitted to the datapoints. The other curves are force-fitted to the datapoints using parameters from the other model families (non-CNN and LLM) to compare to the CNN best fit exponential curve. In general, parameters from the other model families can almost equally fit the datapoints, suggesting the universality of an exponential relationship explaining ML screening performance as a function of cluster separability regardless of model families.

Figure 7

Figure 6 ML screening performance (MCC) for LLM prompt engineering models as a function of separability of relevant and irrelevant article clusters as measured by DBS. Because results varied across LLM models, we visually labeled the latter instead of datasets (like in Figures 4 and 5) to allow interpretation of LLM model trends. However, note that model fitting was done to datapoints from LLM models applied to all the datasets. LLM best-fit exponential curve is the main curve fitted to the datapoints. The other curves are force-fitted to the datapoints using parameters from the other model families (non-CNN and CNN) to compare to the LLM best fit exponential curve. In general, parameters from the other model families can almost equally fit the datapoints, suggesting the universality of an exponential relationship explaining ML screening performance as a function of cluster separability regardless of model families.

Supplementary material: File

Ali et al. supplementary material

Ali et al. supplementary material
Download Ali et al. supplementary material(File)
File 389.7 KB