Hostname: page-component-89b8bd64d-72crv Total loading time: 0 Render date: 2026-05-07T08:46:10.141Z Has data issue: false hasContentIssue false

A practical guide to evaluating sensitivity of literature search strings for systematic reviews using relative recall

Published online by Cambridge University Press:  07 March 2025

Malgorzata Lagisz*
Affiliation:
Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada Evolution & Ecology Research Centre and School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, NSW, Australia Theoretical Sciences Visiting Program, Okinawa Institute of Science and Technology Graduate University, Onna, Japan
Yefeng Yang
Affiliation:
Evolution & Ecology Research Centre and School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, NSW, Australia
Sarah Young
Affiliation:
Carnegie Mellon University, Pittsburgh, PA, USA
Shinichi Nakagawa*
Affiliation:
Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada Evolution & Ecology Research Centre and School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, NSW, Australia Theoretical Sciences Visiting Program, Okinawa Institute of Science and Technology Graduate University, Onna, Japan
*
Corresponding authors: Malgorzata Lagisz and Shinichi Nakagawa; Emails: losialagisz@gmail.com; snakagaw@ualberta.ca
Corresponding authors: Malgorzata Lagisz and Shinichi Nakagawa; Emails: losialagisz@gmail.com; snakagaw@ualberta.ca
Rights & Permissions [Opens in a new window]

Abstract

Systematic searches of published literature are a vital component of systematic reviews. When search strings are not “sensitive,” they may miss many relevant studies limiting, or even biasing, the range of evidence available for synthesis. Concerningly, conducting and reporting evaluations (validations) of the sensitivity of the used search strings is rare, according to our survey of published systematic reviews and protocols. Potential reasons may involve a lack of familiarity or inaccessibility of complex sensitivity evaluation approaches. We first clarify the main concepts and principles of search string evaluation. We then present a simple procedure for estimating a relative recall of a search string. It is based on a pre-defined set of “benchmark” publications. The relative recall, that is, the sensitivity of the search string, is the retrieval overlap between the evaluated search string and a search string that captures only the benchmark publications. If there is little overlap (i.e., low recall or sensitivity), the evaluated search string should be improved to ensure that most of the relevant literature can be captured. The presented benchmarking approach can be applied to one or more online databases or search platforms. It is illustrated by five accessible, hands-on tutorials for commonly used online literature sources. Overall, our work provides an assessment of the current state of search string evaluations in published systematic reviews and protocols. It also paves the way to improve evaluation and reporting practices to make evidence synthesis more transparent and robust.

Information

Type
Tutorial
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-ShareAlike licence (https://creativecommons.org/licenses/by-sa/4.0), which permits re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 Conceptual representation of the body of evidence, database coverage, search string capture, and search string evaluation. (a) The vast body of evidence contains a certain unknown number of relevant studies but only some are indexed in a given database. The true number of relevant bibliographic records in a given database is unknown. (b) The subset of records retrieved by a database search contains an unknown number of relevant and irrelevant bibliographic records until all relevant records are assessed for relevance (screened). Then, search precision can be calculated as the proportion of relevant records. Search precision and the total number of captured relevant records can also be estimated by screening a random sub-sample of records from all search hits. (c) Search evaluation (validation or benchmarking) can be performed using a predefined test set of relevant studies (benchmarking set). Search sensitivity is calculated as a proportion (or percentage) of indexed benchmark studies (bibliographic records) that are found by a search string.

Figure 1

Figure 2 Results of two surveys assessing reporting of search string development and evaluation in two types of representative literature samples: a sample of 100 published Cochrane protocols (Cochrane) and a cross-disciplinary sample of systematic reviews (other), from 2022. Comparison between the two literature samples (Cochrane vs. other) for: (a) frequencies of providing a description of the process used for developing the final search string, (b) frequencies of providing a record of different search string variants tried during string development, (c) frequencies of reviewers noting using a set of known relevant studies to discover relevant terms for the search string, (d) frequencies of providing a mention of performing search string evaluation (validation, benchmarking, etc.), (e) frequencies of involving an information specialist in planning or performing the systematic review, (f) frequencies of providing the final search strings and in how much detail. (g) Bar plot showing the most common search sources (databases, search platforms, or engines) that were used (or planned) for performing searches (most systematic reviews used more than one search source so that proportions do not add to 1) for the two literature samples. All detailed results and our analysis code are available in Supplementary File 2 and at https://github.com/mlagisz/method_benchmarking_survey.

Figure 2

Figure 3 Practical implementations of search sensitivity evaluation (benchmarking)s. (a) A simplified benchmarking workflow for a single database (search source). (b) two alternative approaches to working with multiple databases (search sources): I—searches are evaluated separately for each database before being aggregated into an overall estimate; II—records (hits) retrieved by search strings in all databases are pooled together before evaluation is performed. “Steps” refer to steps 1–6 shown in panel A.

Supplementary material: File

Lagisz et al. supplementary material

Lagisz et al. supplementary material
Download Lagisz et al. supplementary material(File)
File 10.8 MB