Hostname: page-component-77f85d65b8-8v9h9 Total loading time: 0 Render date: 2026-03-28T18:18:29.382Z Has data issue: false hasContentIssue false

Strategizing AI utilization for psychological literature screening: A comparative analysis of machine learning algorithms and key factors to consider

Published online by Cambridge University Press:  19 December 2025

Lars König*
Affiliation:
Department of Psychology, Helmut-Schmidt-University, Germany
Steffen Zitzmann
Affiliation:
Department of Psychology, MSH Medical School Hamburg, Germany
Martin Hecht
Affiliation:
Department of Psychology, Helmut-Schmidt-University, Germany
*
Corresponding author: Lars König; Email: lars.koenig@hsu-hh.de
Rights & Permissions [Opens in a new window]

Abstract

With the rapid growth of scholarly literature, efficient artificial intelligence (AI)–aided abstract screening tools are becoming increasingly important. This study evaluated 10 different machine learning (ML) algorithms used in AI-aided screening tools for ordering abstracts according to their estimated relevance. We focused on assessing their performance in terms of the number of abstracts required to screen to achieve a sufficient detection rate of relevant articles. Our evaluation included articles screened with diverse inclusion and exclusion criteria. Crucially, we examined how characteristics of the screening data—such as the proportion of relevant articles, the overall frequency of abstracts, and the amount of training data—impacted algorithm effectiveness. Our findings provide valuable insights for researchers across disciplines, highlighting key factors to consider when selecting an ML algorithm and determining a stopping point for AI-aided screening. Specifically, we observed that the algorithm combining the logistic regression (LR) classifier with the sentence-bidirectional encoder representations from transformers (SBERT) feature extractor outperformed other algorithms, demonstrating both the highest efficiency and the lowest variability in performance. Nonetheless, the algorithm’s performance varied across experimental conditions. Building on these findings, we discuss the results and provide practical recommendations to assist users in the AI-aided screening process.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open materials
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 Active learning in the realm of AI-aided screening.Note: The human-in-the-loop approach, along with the metrics used to evaluate screening performance.

Figure 1

Table 1 Descriptives of the original and artificially constructed abstract collections (Study 1)

Figure 2

Figure 2 Screening cost at 95% sensitivity by machine learning algorithm (Study 1).Note: The bars reflect the median performance, the whiskers represent the interquartile range, and the points represent the 90% percentile. For each ML algorithm, descriptive statistics are based on 84,000 artificial abstract collections. SC@95% represents the percentage of abstracts that needed to be screened to identify 95% of the relevant articles.

Figure 3

Figure 3 Screening cost and false-positive rate at 95% sensitivity by prevalence (Study 1).Note: The bar plots reflect the median performance, the error bars represent the interquartile range, and the points represent the 90% percentile. Each summary statistic summarizes 180,000 observations. a) Performance in terms of Screening Cost (SC), b) performance in terms of False Positive Rate (FPR) when identifying 95% of the relevant literature (@95%).

Figure 4

Figure 4 Screening cost and false-positive rate at 95% sensitivity by machine learning algorithm and prevalence (Study 1).Note: The bar plots reflect the median performance, the error bars represent the interquartile range, and the points reflect the 90% percentile. Each summary statistic summarizes 21,000 observations. a) Performance in terms of Screening Cost (SC), b) performance in terms of False Positive Rate (FPR) when identifying 95% of the relevant literature (@95%).

Figure 5

Table 2 Screening cost at 95% sensitivity by ML algorithm and prevalence ratio (Study 1)

Figure 6

Figure 5 Screening cost at 95% sensitivity of the LR + SBERT algorithm for the main and interaction effects of prevalence and frequency (Study 2).Note: The bar plots reflect the median performance, the error bars represent the interquartile range, and the points represent the 90% percentile. Summary statistics for prevalence, frequency, and their interaction are based on 54000, 81,000, and 27,000 observations, respectively. r.S. = relevant abstracts in the screening set.

Figure 7

Table 3 Screening cost at 95% sensitivity by main effects (Study 2)

Figure 8

Table 4 Screening cost at 95% sensitivity by sample size and prevalence (Study 2)

Figure 9

Figure 6 Screening cost at 95% sensitivity of the LR + SBERT algorithm for the main effect of training set and its interactions with prevalence and frequency (Study 2).Note: The bar plots reflect the median performance, the error bars represent the interquartile range, and the points represent the 90% percentile. Summary statistics are based on 54,000, 18,000, 27,000, and 9,000 observations for panels (a), (b), (c), and (d), respectively. r.S. = relevant abstracts in the screening set; r.T. = relevant abstracts in the training sets.

Figure 10

Table 5 Recommendations for the prescreening phase, stopping rule selection, and model setup

Supplementary material: File

König et al. supplementary material

König et al. supplementary material
Download König et al. supplementary material(File)
File 2.9 MB