BibliZap: An exploratory evaluation of an automated multi-level citation searching tool for systematic and rapid reviews

Raphaël Bentegeac; Bastien Le Guellec; Victor Leblanc; Rémi Lenain; Luc Dauchet; Victoria Gauthier; Erwin Gerard; Emmanuel Chazard; Philippe Amouyel; Estelle Aymes; Aghilès Hamroun

doi:10.1017/rsm.2026.10079

BibliZap: An exploratory evaluation of an automated multi-level citation searching tool for systematic and rapid reviews

Published online by Cambridge University Press: 24 March 2026

Philippe Amouyel and

Raphaël Bentegeac: Affiliation:
Public Health and Epidemiology, Lille University Hospital, France UMR1167 RID-AGE, Pasteur Institute of Lille, France
Bastien Le Guellec: Affiliation:
UMR1167 RID-AGE, Pasteur Institute of Lille, France Inserm, CHU Lille, U1172 LilNCog Lille Neuroscience & Cognition, Lille 2 University of Health and Law, France Neuroradiology, Lille University Hospital, France
Victor Leblanc: Affiliation:
Hauts-de-France, Protection Maternelle et Infantile du Département du Nord, France
Rémi Lenain: Affiliation:
Nephrology, Lille University Hospital, France
Luc Dauchet: Affiliation:
Public Health and Epidemiology, Lille University Hospital, France UMR1167 RID-AGE, Pasteur Institute of Lille, France
Victoria Gauthier: Affiliation:
Public Health and Epidemiology, Lille University Hospital, France UMR1167 RID-AGE, Pasteur Institute of Lille, France
Erwin Gerard: Affiliation:
CHU Lille, ULR 2694—METRICS, CERIM, Public Health Department, Lille 2 University of Health and Law, France
Emmanuel Chazard: Affiliation:
CHU Lille, ULR 2694—METRICS, CERIM, Public Health Department, Lille 2 University of Health and Law, France
Philippe Amouyel: Affiliation:
Public Health and Epidemiology, Lille University Hospital, France UMR1167 RID-AGE, Pasteur Institute of Lille, France
Estelle Aymes: Affiliation:
Public Health and Epidemiology, Lille University Hospital, France UMR1167 RID-AGE, Pasteur Institute of Lille, France
Aghilès Hamroun*: Affiliation:
Public Health and Epidemiology, Lille University Hospital, France UMR1167 RID-AGE, Pasteur Institute of Lille, France
*: Corresponding author: Aghilès Hamroun; Email: aghiles.hamroun@gmail.com

Article contents

Abstract
Highlights
Introduction
Methods
Results
Discussion
Author contributions
Competing interest statement
Data availability statement
Funding statement
Ethics statement
Footnotes
References

Rights & Permissions

Abstract

The exponential growth of scientific literature poses increasing challenges for evidence synthesis. Systematic reviews (SRs) usually rely on keyword-based database searches, which are limited by inconsistent terminology and indexing delays. Citation searching—identifying studies that cite or are cited by known relevant articles—offers a complementary route to uncover additional evidence but remains poorly automated and integrated into screening workflows. We developed BibliZap, an open-source, fully automated citation-searching tool built on Lens.org data, performing multi-level forward and backward citation searches with relevance-based ranking. Its performance was evaluated across 66 published SRs, comparing five approaches: (1) PubMed-only searches; (2) PubMed followed by BibliZap restricted to the top 500 ranked results; (3) PubMed followed by full BibliZap screening; and (4–5) two exploratory early-stop strategies where BibliZap was initiated after identifying the first or the first three PubMed relevant records. The primary outcome was sensitivity, with secondary assessments of screening workload and precision. When used after PubMed screening, BibliZap increased mean sensitivity from 75% to 97%, achieving complete recall in over half of the reviews. Screening only the top 500 outputs still allowed over 90% of reviews to reach or exceed 80% recall. BibliZap recovered a median of three additional included articles per review, not retrieved by PubMed, while adding a median of 6,450 additional records. Citation searching via BibliZap enhances the completeness of evidence retrieval in SRs based on restricted database searches and supports transparent, scalable workflows adaptable to rapid and exploratory review contexts.

Keywords

automation bibliographic search citation searching evidence synthesis open-source tool systematic review

Information

Type: Research Article
Information: Research Synthesis Methods , First View , pp. 1 - 14

DOI: https://doi.org/10.1017/rsm.2026.10079 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open data Open materials
Copyright: © The Author(s), 2026. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology

Highlights

What is already known?

Citation searching is increasingly recommended as a complementary strategy to database searches in systematic reviews (SRs). However, existing tools for citation searching are often limited by manual workflows, a lack of iterative capabilities, or opaque ranking mechanisms. To date, few studies have quantitatively assessed the performance of automated citation searching tools across a large and diverse set of reviews.

What is new?

We present BibliZap, an open-source, fully automated citation searching engine capable of multi-level forward and backward citation searching with ranked output. Using a curated set of 66 published SRs, in which the set of included articles was manually extracted as a retrospective gold standard, we show that citation searching via BibliZap substantially improves sensitivity, recovering additional relevant studies compared with PubMed-only searches. We also compare performance across distinct search configurations and highlight the contribution of ranked prioritization for rapid reviews.

Potential impact for journal readers

This study provides a robust, reproducible evaluation framework for citation searching tools and introduces a transparent and auditable solution for improving recall in evidence synthesis. BibliZap’s design and performance suggest that it may serve not only as a complement to traditional database queries but also as a standalone triage tool in exploratory or rapid reviews, offering a practical contribution to the methodology of evidence synthesis.

1 Introduction

The exponential growth of the scientific literature has transformed the way systematic reviews (SRs) are conducted. MEDLINE alone now indexes more than 35 million citations, with a publication volume doubling approximately every 17 years.Reference Bornmann, Haunschild and Mutz ¹ As evidence-based medicine relies increasingly on comprehensive syntheses, SR teams must navigate a research landscape in which both information abundance and retrieval complexity have become structural challenges.Reference Taylor ² ^– Reference McGowan and Sampson ⁶

The default approach to identifying relevant studies still relies on keyword-based queries in bibliographic databases such as PubMed.Reference De Groote, Shultz and Blecic ⁷ ^, Reference Cooper, Booth, Varley-Campbell, Britten and Garside ⁸ While indispensable, this strategy is known to suffer from sensitivity limitations in fields where articles are inconsistently indexed or terminology is heterogeneous and evolving.Reference Linder, Kamath, Pratt, Saraykar and Volk ⁹ Even well-constructed queries often fail to retrieve all eligible records, prompting recommendations to supplement database screening with citation searching (i.e., identifying articles that cite or are cited by known relevant studies).Reference Cooper, Booth, Varley-Campbell, Britten and Garside ⁸ ^, Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald ¹⁰

Citation searching, often referred to as snowballing or citation mining/tracking, has become a recognized component of systematic search strategies.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald ¹⁰ ^– Reference Cooper, Lovell, Husk, Booth and Garside ¹⁴ Forward and backward citation exploration can uncover relevant records that elude traditional database queries due to vocabulary mismatch or indexing lag.Reference Greenhalgh and Peacock ¹⁵ ^, Reference Booth ¹⁶ In a recent scoping review, Hirt et al. found that citation searching adds unique records in 96% of SRs in which it is used; however, as these were methodological studies focusing on cases where citation searching was applied, this figure likely overestimates its yield in unselected reviews.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald ¹⁰ Based on this growing body of evidence, the recently published Terminology, Application, and Reporting of Citation Searching (TARCiS) statement outlines ten recommendations for the application and reporting of citation searching, including the possibility of conducting it over several rounds, whereby references identified at one step are used as new starting points for further exploration, and emphasizes the need for transparent documentation.Reference Hirt, Nordhausen, Fuerst, Ewald and Appenzeller-Herzog ¹¹

Despite this consensus, citation searching remains underused and inconsistently applied.Reference Briscoe, Bethel and Rogers ¹⁷ Its uptake is limited by the high time and resource burden required for a process whose yield is often heterogeneous and difficult to anticipate, due to incomplete database coverage, delayed indexing, and limited functionality for multi-seed citation export.Reference Cooper, Booth, Britten and Garside ¹² ^, Reference Haddaway, Grainger and Gray ¹⁸ To date, no free and fully automated tool has been available that enables multi-level citation searching with ranking and batch output across multiple seeds.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald ¹⁰ ^, Reference Haddaway, Grainger and Gray ¹⁸ More advanced platforms, such as Web of Science or Scopus, restrict automation or rely on proprietary relevance scores that are opaque and difficult to audit.Reference Gusenbauer and Haddaway ⁵ In practice, these limitations severely hinder the integration of citation searching into time-sensitive or resource-limited SR workflows.Reference Briscoe, Bethel and Rogers ¹⁷

To address the limitations of existing citation searching tools, we developed BibliZap, an open-source, fully automated engine that performs multi-level forward and backward citation searching and returns a ranked list of articles based on transparent relevance criteria. In this exploratory study, we evaluate BibliZap’s added value as a complementary tool for medical and health-related SRs. Using a retrospective gold standard derived from a previously selected corpus of published SRs,Reference Leblanc, Hamroun, Bentegeac, Le Guellec, Lenain and Chazard ¹⁹ we assess its ability to identify relevant studies that were missed by PubMed-only searches. We examine its performance when applied at various time points in the review workflow, from early-stage integration (after identifying a few relevant articles) to post-database screening. We hypothesize that BibliZap, by operationalizing key recommendations from recent citation-searching guidelines,Reference Hirt, Nordhausen, Fuerst, Ewald and Appenzeller-Herzog ¹¹ can substantially improve recall while maintaining a manageable screening workload.

2 Methods

2.1 Description of the BibliZap tool

BibliZap is an open-source, fully automated citation searching engine designed to support systematic, bidirectional citation searching with multi-level expansion. The tool is implemented in Rust, a modern, memory-safe programming language optimized for performance and reliability, enabling efficient processing of large citation networks. It leverages the Lens platform (https://www.lens.org), which aggregates citation data from CrossRef, OpenAlex, PubMed, PubMed Central, and CORE.Reference Visser, van Eck and Waltman ²⁰ ^, Reference Martín-Martín, Thelwall, Orduna-Malea and Delgado López-Cózar ²¹ As of early 2025, the Lens indexes over 272 million scholarly records, offering broad multidisciplinary coverage. In this study, only PubMed-indexed records accessible through the Lens were evaluated. Each run generates a reproducible log of all parameters, consistent with PRISMA-S reporting standards. BibliZap can be deployed through a web interface, an R package, or a command-line interface, depending on user preference. The codebase and documentation are publicly available on GitHub (https://github.com/BibliZap/biblizap-server).

To initiate a query, users input one or several identifiers (PMID, DOI, or Lens ID) as seed references. BibliZap then performs a recursive traversal of the citation graph, retrieving articles that cite or are cited by the input set (forward and backward citation searchingReference Lefebvre, Glanville and Briscoe ¹³ ). In practice, BibliZap enables a multi-level expansion of the citation graph, in which all records retrieved at a given level are reused as seeds for the next traversal step.Reference Lefebvre, Glanville and Briscoe ¹³ Each additional level increases the number of API requests and retrieved records by roughly two orders of magnitude, which makes the default depth of 2 a pragmatic trade-off between completeness and computational feasibility. Expansion up to depth 3 is technically supported, though not recommended for routine use. Each article in the corpus receives a ‘relevance score’, defined as the sum of its occurrences across all citation paths:

$$\begin{align*}{\mathrm{Relevance}\kern0.17em \mathrm{score}}_{\mathrm{article}}={\sum}_{i=1}^d{A}_i,\end{align*}$$

where A_i is the number of times the article appears at depth i, and d is the selected depth (Figure 1). When the search is limited to a single direction—either forward or backward—depth-2 expansion retrieves specific subsets of the citation network. A forward-only configuration identifies papers citing those that already cite the seed articles, while a backward-only configuration retrieves papers cited by those already referenced by the seeds. When both directions are selected, these outputs naturally encompass co-cited and co-citing records, corresponding to indirect relationships within the second-level graph expansion.

Figure 1

BibliZap web application input interface. User interface for launching a BibliZap query from the public web app (https://biblizap.org). Users input one or several seed identifiers (PMIDs, DOIs, or Lens IDs) separated by spaces, select the citation depth, direction (backward, forward, or both), and the number of results to retrieve. The application is powered by The Lens and includes a Firefox plugin for quick access from article pages.

2.2 Gold-standard corpus

To evaluate BibliZap’s performance, we used a curated set of SRs previously assembled by our team.Reference Leblanc, Hamroun, Bentegeac, Le Guellec, Lenain and Chazard ¹⁹ These reviews were published between 2012 and 2021 in six high-impact general medicine journals (The New England Journal of Medicine, The Lancet, JAMA, BMJ, Annals of Internal Medicine, and Nature Reviews Disease Primers). Eligible SRs had to (i) provide a reproducible PubMed search query, (ii) be fully accessible on PubMed Central, and (iii) explicitly list the studies included in the final evidence synthesis.

For each SR, the gold standard was defined as the set of included articles contributing to the review’s results or meta-analyses. This set was manually extracted by four reviewers (VL, RB, BLG, and AH) from the main text and supplementary materials. References used for background discussion, general epidemiology, methodological justifications, or reporting standards (e.g., PRISMA, GRADE) were excluded. Only studies with a PubMed ID were retained, enabling direct comparison between PubMed queries and BibliZap outputs. The proportion of excluded articles without a PMID was below 5%.

2.3 Search strategies

We evaluated five search strategies to identify the set of included studies (gold standard). Approaches 1–3 were designed to assess citation searching as a complement to full SR workflows, while Approaches 4 and 5 simulated rapid or focused searching scenarios using early stopping based on PubMed’s Best Match ranking.Reference Fiorini, Canese and Starchenko ²² Best Match is PubMed’s default relevance-based algorithm, which ranks records using a machine learning model that prioritizes term frequency, citation patterns, and recency.

• Approach 1 (PubMed): Full manual screening of all records retrieved by the original PubMed query (no BibliZap).
• Approach 2 (PubMed + BibliZap top 500): After full PubMed screening, BibliZap was seeded with all gold-standard articles retrieved by PubMed, and only the 500 highest-ranked BibliZap results were screened.
• Approach 3 (PubMed + BibliZap): After full PubMed screening, BibliZap was seeded with all gold-standard articles retrieved by PubMed, and the complete ranked output was screened in full.
• Approach 4 (PubMed early stop + BibliZap): PubMed results were screened in Best Match order, and the process was stopped after identifying the first gold-standard article. This record was used as the sole seed for BibliZap, whose output was then screened in full.
• Approach 5 (PubMed late stop + BibliZap): PubMed results were screened in Best Match order until three gold-standard articles were identified, which were then used jointly as seeds for BibliZap, whose output was then screened in full.

In all BibliZap-related configurations (Approaches 2–5), the following parameters were standardized: bidirectional search, depth of 2 levels, and a maximum output size of 10,000 articles, consistent with the retrieval limit imposed by the PubMed search engine. For each run, the SR itself was excluded from the citation graph to avoid circular chaining and unfairly improving Biblizap’s performance. Additionally, to match the original review’s publication context, a time filter was applied to restrict results to articles published during the SR’s publication window.

2.4 Metrics and statistical analysis

The primary outcome was sensitivity, defined as the number of gold standard articles retrieved by each approach divided by the total number of articles in the gold standard set. For each strategy, we estimated (i) the sensitivity across all included reviews, (ii) the proportion of reviews reaching predefined sensitivity thresholds (≥80%, ≥90%, and 100%)—with 90% often cited in the methodological literature as a commonly targeted recall level in SRs (although acceptable thresholds remain context-dependent),Reference Heinen, Goossen, Lunny, Hirt, Puljak and Pieper ²³ ^, Reference Sampson, Zhang and Morrison ²⁴ and 80% a pragmatic threshold for exploratory or rapid review contextsReference Bramer, Rethlefsen, Kleijnen and Franco ²⁵ —(iii) the number of articles screened, and (iv) precision, defined as the proportion of included gold-standard articles among all screened records. All performance metrics were summarized as mean values with 95% confidence intervals estimated by bootstrap resampling. Cumulative recall curves (with 95% confidence interval) were plotted to represent the proportion of retrieved gold standard articles as a function of the number of articles screened. PubMed results were analyzed in BestMatch order; BibliZap results were ranked by descending relevance score.

All statistical analyses were performed using R (version 4.2.0) and Python. Data wrangling and visualization were performed using the dplyr and ggplot2 packages (R).

3 Results

3.1 BibliZap interface and workflow

BibliZap can be accessed via a user-friendly web interface (https://biblizap.org, Figure 2), through an R package for local deployment, or via a dedicated Firefox browser extension. Users can perform a multi-seed query by simply separating each identifier (PMID, DOI, Lens ID, or a mix) with a space (Figure 2). Results are returned as a structured table sorted by decreasing score, and include the following fields: DOI, Title, Journal, First Author, Year Published, Summary, Number of Citations, and Relevance Score (Figure 3). A companion log file details the parameters used in each run, ensuring full reproducibility and alignment with PRISMA-S reporting standards.Reference Rethlefsen, Kirtley and Waffenschmidt ²⁶ Export options include both spreadsheet (.xlsx) and RIS formats, ensuring compatibility with a wide range of reference management software. Users can choose to export up to 10,000 ranked records or the entire retrieved corpus at any configured search depth.

Figure 2

Overview of BibliZap’s citation searching process. Starting from one or more seed articles (bottom left), BibliZap performs forward (blue arrow: articles citing the seed) and backward (orange arrow: articles cited by the seed) citation exploration. The first expansion level (Depth 1) retrieves all directly linked articles. These are then used as new seeds to perform a second round of forward and backward citation searching (Depth 2), further expanding the network. In the figure, articles are color-coded to allow visual identification of their propagation through the network. The final output includes all unique articles retrieved across both expansion levels. Each article is assigned a relevance score, defined by the number of times it appears across all citation paths. As illustrated in the right panel, articles that appear most frequently receive the highest scores and are ranked accordingly, enabling transparent prioritization for screening.

Figure 3

Example of BibliZap output interface. Screenshot of the web application output table after running a citation query. Retrieved articles are ranked by a cumulative relevance score and presented with their metadata, including DOI, title, journal, first author, publication year, summary, citation count, and score. Users can filter, sort, and download results for further screening or integration into reference management tools.

3.2 Overall sensitivity and screening burden

Among the 70 SRs initially included, 66 were retained for analysis. In the remaining four cases, the original PubMed search query provided by the authors retrieved no results when re-executed in the PubMed search engine. The final set of 66 reviews represented a total corpus of 2,675 included single references.

Using PubMed only (Approach 1), the mean sensitivity was 75% (95% CI 68–81) with a median of 1,889 articles screened (range 27–10,000) and a mean precision of 2.1% (95% CI 1.5–2.7). With the addition of BibliZap restricted to the top 500 ranked results (Approach 2), sensitivity increased to 91% (95% CI 88–94) with a median of 2,388 articles screened (range 527–10,500) and a mean precision of 1.5% (95% CI 1.1–1.9). When the full-ranked BibliZap output was screened (Approach 3), mean sensitivity reached 97% (95% CI 95–98) with a median of 8,422 articles (range 2,110–19,168) and a mean precision of 0.4% (95% CI 0.3–0.5). The early-stop configuration, where PubMed screening was halted after identifying the first relevant article (Approach 4), achieved a sensitivity of 75% (95% CI 68–81) with 6,126 articles screened (range 0–9,999) and a mean precision of 0.5% (95% CI 0.4–0.7). Finally, the late-stop configuration based on the first three PubMed-identified articles (Approach 5) reached 90% (95% CI 86–93) with a median of 7,198 articles (range 3,305–9,623) and a mean precision of 0.5% (95% CI 0.4–0.6).

In the full BibliZap configuration (Approach 3), the number of additional articles retrieved and screened from BibliZap after PubMed ranged from 2,083 to 9,168 (median: 6,450). Among these, the number of additional gold-standard articles recovered ranged from 1 to 78 per review (median: 3). When restricted to the top 500 ranked outputs, citation searching with BibliZap retrieved a median of two additional included articles per review (range: 0–55). Detailed distributions of recall, precision, and screening volume for each approach—and for each of the 66 SRs individually—are presented in Supplementary Table 1.

Among the 30 reviews for which PubMed retrieved fewer than 90% of the included articles, supplementation with the full BibliZap output increased recall above this threshold in 22 cases. Similarly, of the 18 reviews with sensitivity below 80% using PubMed alone, 17 surpassed this threshold after applying full BibliZap citation searching. These review-level distributions of recall, stratified by approach and threshold, are illustrated by horizontal stacked bars (Figure 4). At the aggregated level, a sensitivity of ≥80% was reached in 52% of reviews using PubMed only, in 89% of reviews using the PubMed + top 500 BibliZap-ranked articles (Approach 2), and in 97% using the PubMed + full BibliZap output (Approach 3). The proportions of reviews reaching ≥90% sensitivity were 39%, 67%, and 92%, respectively. Complete recall (100%) was achieved in 18% of reviews with PubMed alone, 39% with top-500 BibliZap supplementation, and 58% with full BibliZap output screening (Figure 5).

Figure 4

Proportion of relevant articles retrieved per systematic review, according to the three main citation searching configurations. Each horizontal bar represents one review (N = 66). Stacked segments indicate the proportion of included (gold-standard) articles retrieved using PubMed only (Approach 1, dark blue), PubMed supplemented with the top 500 BibliZap results (Approach 2, grey), and PubMed supplemented with the full BibliZap output (Approach 3, black). Bars are ordered by overall sensitivity within each review.

Figure 5

Proportion of systematic reviews (N = 66) meeting predefined sensitivity thresholds (≥80%, ≥90%, and 100%) across the five search strategies. Bars represent mean proportions with 95% confidence intervals. Approaches include: PubMed only (Approach 1, dark blue), PubMed + BibliZap top 500 (Approach 2, grey), PubMed + BibliZap full output (Approach 3, black), PubMed early stop + BibliZap (Approach 4, light blue), and PubMed late stop + BibliZap (Approach 5, teal).

3.3 Cumulative recall

At the aggregated level, sensitivity reached 75% (95% CI: 73.3%–76.6%) after full screening of PubMed records and increased to 90% (95% CI: 88.8%–91.1%) after screening 3,000 additional articles ranked by BibliZap (Figure 6). The curve gradually plateaued near complete recall at around 9,000 BibliZap-ranked articles screened, reaching 96.6% sensitivity (95% CI: 94.3%–97.6%). Focusing on the subset of gold-standard articles missed by PubMed but recovered through BibliZap, 56% were found within the top 500 ranked results (95% CI: 52.3%–59.6%), 66% within the top 1,000 (95% CI: 61.8%–68.8%), 73% within 2,000 (95% CI: 69.9%–76.5%), and 84% within the top 6,000 (95% CI: 80.9%–86.3%). These cumulative detection rates for PubMed-missed articles are also shown in Figure 6 (inset).

Figure 6

Cumulative recall curve for the PubMed + full-seed BibliZap strategy (Approach 3). Cumulative sensitivity is plotted as a function of the number of articles screened, in descending relevance order (Best Match for PubMed; BibliZap relevance score for citation searching). The transition between PubMed (dark blue) and BibliZap (black) corresponds to the point where PubMed screening ends and citation searching begins. Right panel: among gold-standard articles not retrieved by PubMed (Approach 1), the cumulative proportion recovered by BibliZap (Approach 3) is plotted against the number of ranked articles screened. Shaded areas represent 95% confidence intervals.

4 Discussion

This study assessed the effectiveness of BibliZap, an open-source citation searching engine, in enhancing the comprehensiveness of SR searches beyond conventional PubMed queries. Our findings demonstrate that supplementing standard PubMed searches with multi-level citation searching substantially improves recall, increasing average sensitivity from 75% to 97%. BibliZap proved particularly effective at identifying relevant articles missed by PubMed searches, retrieving approximately 73% of these missed records within the first 2,000 ranked results. The prioritization and ranked outputs provided by BibliZap thus enabled an efficient recovery of crucial articles while maintaining a manageable screening workload. These results underscore the practical value of automated citation searching methods and highlight BibliZap’s potential to operationalize current methodological recommendations for systematic literature searching.

These findings align with and extend prior evidence showing that citation searching can enhance recall when supplementing database searches, here demonstrated specifically in the context of PubMed-only queries.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald ¹⁰ ^, Reference Greenhalgh and Peacock ¹⁵ ^, Reference Booth ¹⁶ ^, Reference Gusenbauer ²⁷ Only a limited number of studies have quantified this added value. Haddaway et al. (2022), in an evaluation of the Citationchaser tool across ten SRs, reported that backward citation searching alone recovered 14% of included records that had not been identified through traditional database queries.Reference Haddaway, Grainger and Gray ¹⁸ In contrast, Wright et al. (2014), examining forward citation chasing across 40 Cochrane reviews, found that only one additional included study was identified, underscoring the variable yield of citation-based strategies depending on topic and method.Reference Wright, Golder and Rodriguez-Lopez ²⁸ Beyond such isolated estimates, Hirt et al. (2023) provided a scoping synthesis of 47 reviews that used citation chasing and concluded that backward or forward strategies yielded unique records in over 90% of cases.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald ¹⁰ Similarly, Frandsen and Eriksen (2023) showed that supplementary search techniques, including citation searching, consistently retrieved additional eligible studies in qualitative SRs, reinforcing the importance of systematically integrating these methods rather than treating them as optional add-ons.Reference Frandsen and Eriksen ²⁹ The diversity and heterogeneity of citation coverage across databases were further highlighted by Gusenbauer (2024), who compared 59 citation indexes and documented substantial differences between platforms such as Lens.org, OpenAlex, Scopus, and Web of Science.Reference Gusenbauer ²⁷ In this context, our results add to the evidence base by providing a large-scale, quantitative evaluation of an open-source tool—BibliZap—implementing a fully automated, multi-seed, multi-level citation-searching process with ranked outputs built on the Lens citation graph.

Among currently available tools, Citationchaser is the most comparable open-source solution.Reference Haddaway, Grainger and Gray ¹⁸ It supports backward and forward citation retrieval and allows multiple inputs, but is limited to one level of expansion and does not provide ranking or relevance-based prioritization. While useful for manual citation exploration, its output must be reviewed in full, which may be impractical for large citation neighborhoods. Other tools, such as CoCites, recommend related papers based on co-citation and bibliographic coupling but rely on a proprietary algorithm and do not generate reproducible, parameterized outputs.Reference Janssens, Gwinn, Brockman, Powell and Goodman ³⁰ Scite provides unique features by identifying citation intent (e.g., supporting or contrasting), but it is not designed for high-recall evidence retrieval.Reference Nicholson, Mordaunt and Lopez ³¹ Citation Cloud, integrated with the Anne O’Tate interface for PubMed, offers graphical access to an article’s citation neighborhood, including co-cited and bibliographically coupled works.Reference Smalheiser, Schneider, Torvik, Fragnito and Tirk ³² However, it is centered on a single article, does not support multi-seed workflows, and lacks batch screening functionality. Other recent tools, such as SpiderCite and Paperfetcher, further illustrate the diversity of available solutions.Reference Gusenbauer ²⁷ Both perform single-level citation expansion and lack ranking or parameterized output. Paperfetcher allows either forward or backward exploration through a lightweight interface, whereas SpiderCite focuses on one-level citation retrieval. In contrast, BibliZap was designed to align with several key recommendations of the TARCiS statement.Reference Hirt, Nordhausen, Fuerst, Ewald and Appenzeller-Herzog ¹¹ Its recursive traversal corresponds to multi-level expansion of the citation network rather than screening-based iteration. This design choice allows comprehensive graph exploration while maintaining a transparent and parameterized structure. It supports bidirectional citation searching (Recommendation 2), accepts multiple seed records (Recommendation 4), and allows multi-level expansion (Recommendation 8). While BibliZap does not yet implement automated de-duplication as outlined in Recommendation 7, this functionality is under active development. The tool’s open-source architecture contributes to its transparency and long-term sustainability, consistent with Recommendation 10.

Moreover, although prioritization of results is not yet formally included in the TARCiS recommendations, it has been identified as a key research priority.Reference Hirt, Nordhausen, Fuerst, Ewald and Appenzeller-Herzog ¹¹ BibliZap’s ranking algorithm directly contributes to this agenda by supporting early stopping strategies, which may be particularly valuable in time-constrained or resource-limited contexts.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald ¹⁰ ^, Reference Haddaway, Grainger and Gray ¹⁸ In our study, more than half of the relevant studies missed by PubMed were retrieved within the top 500 BibliZap-ranked articles, and over 70% within the first 2,000, highlighting the system’s ability to concentrate relevant literature at the top of the output. This capacity to prioritize high-yield records early in the screening process echoes prior methodological reflections that acknowledge the potential of standalone citation searching outside the traditional SR framework, particularly in scoping reviews or exploratory syntheses.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald ¹⁰ Such approaches are especially relevant when conventional keyword-based retrieval may fail to identify all pertinent studies. Our results further show that a lightweight configuration (triggering BibliZap after the identification of only three relevant articles in PubMed) yielded a mean sensitivity of 90%, at the cost of a significant increase in workload. This supports, however, the idea that citation searching, when supported by a ranking algorithm, can offer flexible entry points into the literature, depending on user needs and available resources. This perspective aligns with earlier empirical work by Belter (2016), who demonstrated through empirical analysis of eight bibliometric reviews that forward citation searching alone retrieved between 56% and 88% of relevant studies, with a median recall of 72%, and with his subsequent study (2017), which introduced a citation-ranking algorithm and achieved up to 87% recall across 23 SRs.Reference Belter ³³ ^, Reference Belter ³⁴ His work highlights the potential of citation networks to serve as primary data sources in structured but non-systematic evidence syntheses. BibliZap’s ability to deliver multi-level, ranked, and transparent outputs from multiple seeds thus makes it particularly suited for such alternative applications, offering a flexible, auditable, and efficient citation searching solution beyond its role in formal SRs.

This exploratory study has several limitations. First, it was conducted retrospectively using published SRs. Although these reviews spanned diverse medical topics and were selected from high-impact journals, prospective validation within real-time screening workflows would provide a more robust assessment of usability in practice. Second, we defined the gold standard as the set of articles included in each review. These inclusion sets may be incomplete or imprecise, particularly when the reporting of methods or supplementary materials was suboptimal. Nonetheless, all gold-standard articles were manually verified by independent reviewers to ensure fidelity, and we excluded records not indexed in PubMed to allow consistent comparison across tools. Third, our evaluation was limited to the PubMed component of each review, meaning that ‘missed’ refer specifically to records not retrieved by PubMed. While this does not capture the multi-database strategies used in most SRs, fewer than 5% of included studies in our corpus lacked a PubMed ID, suggesting limited impact on overall findings. This restriction does not constrain the applicability of BibliZap, which can be seeded from any identifier indexed by Lens.org. Fourth, BibliZap currently performs direct citation searching and, at depth-2 expansion, inherently retrieves co-cited and co-citing records. However, it does not yet implement dedicated co-citation or bibliographic-coupling algorithms as standalone search strategies,Reference Boyack and Klavans ³⁵ which could represent a future enhancement of its functionality. Although this design choice promotes transparency by ensuring that each connection is grounded in an explicit citation link, it may limit the discovery of conceptually related but citation-distant literature. At present, the actual impact of co-citation and related methods on systematic retrieval remains poorly evaluated, and their theoretical appeal requires further empirical validation.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald ¹⁰ Citation-based approaches, including BibliZap, may also underrepresent very recent or sparsely cited studies, and their ranked outputs cannot mitigate publication bias—a limitation previously emphasized by Belter,Reference Belter ³³ who noted that certain study types are rarely identifiable through citation relationships alone. However, because BibliZap’s ranking integrates both forward and backward links rather than raw citation counts, it can still highlight recent or methodologically comprehensive articles that cite relevant prior work, partially attenuating this limitation. Moreover, we did not formally compare depth-1 and depth-2 configurations. While depth 2 was selected as a pragmatic default balancing recall and feasibility, future work should directly assess the marginal benefit and workload implications of single- versus two-level expansion using BibliZap. Finally, our evaluation focused exclusively on sensitivity and screening burden; we did not assess how the use of BibliZap might influence the overall conclusions or strength of evidence in the resulting SRs.

Its compatibility with scripting environments and screening software makes BibliZap particularly well-suited for integration into semi-automated evidence synthesis pipelines. Beyond traditional use cases, its structured and ranked outputs provide a compelling foundation for downstream coupling with large language models (LLMs) trained to support citation screening.Reference Guo, Gupta, Deng, Park, Paget and Naugler ³⁶ ^, Reference Khraisha, Put, Kappenberg, Warraitch and Hadfield ³⁷ Recent findings by Tran et al. suggest that LLMs can assist in automating early phases of study selection by classifying abstracts with reasonable reliability.Reference Tran, Gartlehner and Yaacoub ³⁸ In this context, BibliZap could act as an upstream filtering mechanism, delivering pre-prioritized corpora that reduce the burden on LLM triage models and improve overall pipeline efficiency. Such synergy is not only a matter of time savings, but it may also improve accuracy. Citation screening is a known bottleneck in SRs, where human reviewers have been shown to miss up to 11% of relevant studies during initial selection.Reference Waffenschmidt, Knelangen, Sieben, Bühn and Pieper ³⁹ ^, Reference O’Hearn, MacDonald and Tsampalieros ⁴⁰ By combining BibliZap’s iterative citation searching with LLM-based classification, future systems could mitigate human error while accelerating throughput, particularly in rapid or living reviews, where screening steps must be repeated regularly. Further developments could also explore user-defined weighting schemes within the ranking algorithm (e.g., emphasizing recency, citation context, or specific metadata fields), advanced filtering capabilities (e.g., excluding editorials or protocols), or native integration with review management platforms such as Covidence or Rayyan.Reference Kellermeyer, Harnke and Knight ⁴¹ Coupled with transparent logging of parameters and open access to its underlying codebase, these advances would support the development of fully reproducible, auditable, and scalable workflows for modern evidence synthesis.

To conclude, BibliZap is a transparent, configurable, and automated citation searching tool that improves sensitivity in SRs conducted on restricted database searches (here, PubMed). Its feasibility is best demonstrated in rapid or resource-limited review contexts, particularly when screening is limited to the top-ranked outputs. BibliZap operationalizes key recommendations from the TARCiS statement and offers practical value for both exhaustive and exploratory evidence retrieval tasks.

Acknowledgments

We gratefully acknowledge The Lens (https://www.lens.org) for providing free API access, which enables the functionality of BibliZap’s web-based application. Their support has been instrumental in facilitating open and reproducible citation searching research. This manuscript was prepared with the assistance of a generative AI language model (ChatGPT, OpenAI) used exclusively for improving clarity and style; all intellectual content, data interpretation, and final editing were performed by the authors.

Author contributions

RB: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Supervision, Validation, Visualization, Writing—original draft, and Writing—review and editing; BLG: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Supervision, Validation, Visualization, and Writing—review and editing; VL: Data curation, Investigation, Resources, Software, Validation, and Writing—review and editing; RL: Conceptualization, Validation, Writing—review and editing; LD: Investigation, Methodology, Supervision, Validation, and Writing—review and editing; VG: Methodology, Supervision, Validation, Writing—review and editing; EG: Methodology, Validation, and Writing—review and editing; EC: Conceptualization, Methodology, Supervision, Validation, and Writing—review and editing; PA: Conceptualization, Methodology, Supervision, Validation, and Writing—review and editing; EA: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Supervision, Validation, Visualization, Writing—original draft, and Writing—review and editing; AH: Conceptualization, Data curation, Formal analysis, Methodology, Supervision, Validation, Visualization, Writing—original draft, and Writing—review and editing.

Competing interest statement

The authors declare none.

Data availability statement

The dataset of 66 SRs used in this study is available from the corresponding author upon reasonable request. This includes the original PubMed search queries, manually curated gold-standard article lists, and extracted metadata used for performance evaluation. In addition, all source code and documentation required to run BibliZap are freely available on GitHub (https://github.com/BibliZap/biblizap-server), including: (i) the R package and command-line interface; (ii) instructions for replicating all analyses and figures from this paper; (iii) example seed inputs and configuration files; (iv) tools for exporting and formatting results; and (v) the complete backend and frontend code. This ensures full reproducibility and transparency of our evaluation pipeline.

Funding statement

This research did not receive any specific funding from agencies or organizations in the public, commercial, or not-for-profit sectors.

Ethics statement

This study did not include research involving human participants, personal data, or identifiable information. All analyses were based on publicly available bibliographic metadata from published scientific literature. Therefore, no institutional review board (IRB) approval was required.

Supplementary material

To view supplementary material for this article, please visit http://doi.org/10.1017/rsm.2026.10079.

Footnotes

This article was awarded Open Data and Open Materials badges for transparent practices. See the Data availability statement for details.

†

Raphaël Bentegeac, Bastien Le Guellec, and Victor Leblanc contributed equally to this work. As such, each is fully entitled to claim first authorship in their curriculum vitae and academic records.

References

Bornmann, L, Haunschild, R, Mutz, R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanit Soc Sci Commun. 2021;8(1):224. https://doi.org/10.1057/s41599-021-00903-w CrossRef Google Scholar

Taylor, R. Question negotiation and information seeking in libraries. Coll Res Libr. 1968;29:178–194.10.5860/crl_29_03_178CrossRef Google Scholar

Islamaj Dogan, R, Murray, GC, Neveol, A, Lu, Z. Understanding PubMed(R) user search behavior through log analysis. Database 2009;2009(0 ):bap018–bap018. https://doi.org/10.1093/database/bap018 CrossRef Google Scholar

Egan, M, MacLean, A, Sweeting, H, Hunt, K. Comparing the effectiveness of using generic and specific search terms in electronic databases to identify health outcomes for a systematic review: a prospective comparative study of literature search methods. BMJ Open. 2012;2(3):e001043. https://doi.org/10.1136/bmjopen-2012-001043 CrossRef Google Scholar PubMed

Gusenbauer, M, Haddaway, NR. Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res Synth Methods. 2020;11(2):181–217. https://doi.org/10.1002/jrsm.1378 CrossRef Google Scholar PubMed

McGowan, J, Sampson, M. Systematic reviews need systematic searchers. J Med Libr Assoc JMLA.. 2005;93(1):74–80.Google Scholar PubMed

De Groote, SL, Shultz, M, Blecic, DD. Information-seeking behavior and the use of online resources: a snapshot of current health sciences faculty. J Med Libr Assoc JMLA.. 2014;102(3):169–176. https://doi.org/10.3163/1536-5050.102.3.006 CrossRef Google Scholar PubMed

Cooper, C, Booth, A, Varley-Campbell, J, Britten, N, Garside, R. Defining the process to literature searching in systematic reviews: a literature review of guidance and supporting studies. BMC Med Res Methodol. 2018;18(1):85. https://doi.org/10.1186/s12874-018-0545-3 CrossRef Google Scholar PubMed

Linder, SK, Kamath, GR, Pratt, GF, Saraykar, SS, Volk, RJ. Citation searches are more sensitive than keyword searches to identify studies using specific measurement instruments. J Clin Epidemiol. 2015;68(4):412–417. https://doi.org/10.1016/j.jclinepi.2014.10.008 CrossRef Google Scholar PubMed

Hirt, J, Nordhausen, T, Appenzeller-Herzog, C, Ewald, H. Citation tracking for systematic literature searching: a scoping review. Res Synth Methods. 2023;14(3):563–579. https://doi.org/10.1002/jrsm.1635 CrossRef Google Scholar PubMed

Hirt, J, Nordhausen, T, Fuerst, T, Ewald, H, Appenzeller-Herzog, C, TARCiS Study Group. Guidance on terminology, application, and reporting of citation searching: the TARCiS statement. BMJ. 2024;385:e078384. https://doi.org/10.1136/bmj-2023-078384 CrossRef Google Scholar PubMed

Cooper, C, Booth, A, Britten, N, Garside, R. A comparison of results of empirical studies of supplementary search techniques and recommendations in review methodology handbooks: a methodological review. Syst Rev. 2017;6(1):234. https://doi.org/10.1186/s13643-017-0625-1 CrossRef Google Scholar PubMed

Lefebvre, C, Glanville, J, Briscoe, S, et al. Searching for and selecting studies. In: Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons; 2019:67–107. https://doi.org/10.1002/9781119536604.ch4 CrossRef Google Scholar

Cooper, C, Lovell, R, Husk, K, Booth, A, Garside, R. Supplementary search methods were more effective and offered better value than bibliographic database searching: a case study from public health and environmental enhancement. Res Synth Methods. 2018;9(2):195–223. https://doi.org/10.1002/jrsm.1286 CrossRef Google Scholar PubMed

Greenhalgh, T, Peacock, R. Effectiveness and efficiency of search methods in systematic reviews of complex evidence: audit of primary sources. BMJ. 2005;331(7524):1064–1065. https://doi.org/10.1136/bmj.38636.593461.68 CrossRef Google Scholar PubMed

Booth, A. Unpacking your literature search toolbox: on search styles and tactics. Health Inf Libr J. 2008;25(4):313–317. https://doi.org/10.1111/j.1471-1842.2008.00825.x CrossRef Google Scholar PubMed

Briscoe, S, Bethel, A, Rogers, M. Conduct and reporting of citation searching in Cochrane systematic reviews: a cross-sectional study. Res Synth Methods. 2020;11(2):169–180. https://doi.org/10.1002/jrsm.1355 CrossRef Google Scholar PubMed

Haddaway, NR, Grainger, MJ, Gray, CT. Citationchaser: a tool for transparent and efficient forward and backward citation chasing in systematic searching. Res Synth Methods. 2022;13(4):533–545. https://doi.org/10.1002/jrsm.1563 CrossRef Google Scholar PubMed

Leblanc, V, Hamroun, A, Bentegeac, R, Le Guellec, B, Lenain, R, Chazard, E. Added value of medical subject headings terms in search strategies of systematic reviews: comparative study. J Med Internet Res. 2024;26:e53781. https://doi.org/10.2196/53781 CrossRef Google Scholar PubMed

Visser, M, van Eck, NJ, Waltman, L. Large-scale comparison of bibliographic data sources: Scopus, web of science, dimensions, Crossref, and Microsoft academic. Quant Sci Stud.. 2021;2(1):20–41. https://doi.org/10.1162/qss_a_00112 CrossRef Google Scholar

Martín-Martín, A, Thelwall, M, Orduna-Malea, E, Delgado López-Cózar, E. Google Scholar, Microsoft academic, Scopus, dimensions, web of science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations. Scientometrics 2021;126(1):871–906. https://doi.org/10.1007/s11192-020-03690-4 CrossRef Google Scholar

Fiorini, N, Canese, K, Starchenko, G, et al. Best match: new relevance search for PubMed. PLoS Biol. 2018;16(8):e2005343. https://doi.org/10.1371/journal.pbio.2005343 CrossRef Google Scholar PubMed

Heinen, L, Goossen, K, Lunny, C, Hirt, J, Puljak, L, Pieper, D. The optimal approach for retrieving systematic reviews was achieved when searching MEDLINE and Epistemonikos in addition to reference checking: a methodological validation study. BMC Med Res Methodol. 2024;24(1):271. https://doi.org/10.1186/s12874-024-02384-2 CrossRef Google Scholar PubMed

Sampson, M, Zhang, L, Morrison, A, et al. An alternative to the hand searching gold standard: validating methodological search filters using relative recall. BMC Med Res Methodol. 2006;6:33. https://doi.org/10.1186/1471-2288-6-33 CrossRef Google Scholar

Bramer, WM, Rethlefsen, ML, Kleijnen, J, Franco, OH. Optimal database combinations for literature searches in systematic reviews: a prospective exploratory study. Syst Rev. 2017;6(1):245. https://doi.org/10.1186/s13643-017-0644-y CrossRef Google Scholar PubMed

Rethlefsen, ML, Kirtley, S, Waffenschmidt, S, et al. PRISMA-S: an extension to the PRISMA statement for reporting literature searches in systematic reviews. Syst Rev. 2021;10(1):39. https://doi.org/10.1186/s13643-020-01542-z CrossRef Google Scholar

Gusenbauer, M. Beyond Google scholar, Scopus, and web of science: an evaluation of the backward and forward citation coverage of 59 databases’ citation indices. Res Synth Methods. 2024;15(5):802–817. https://doi.org/10.1002/jrsm.1729 CrossRef Google Scholar PubMed

Wright, K, Golder, S, Rodriguez-Lopez, R. Citation searching: a systematic review case study of multiple risk behaviour interventions. BMC Med Res Methodol. 2014;14(1):73. https://doi.org/10.1186/1471-2288-14-73 CrossRef Google Scholar PubMed

Frandsen, TF, Eriksen, MB. Supplementary strategies identified additional eligible studies in qualitative systematic reviews. J Clin Epidemiol. 2023;159:85–91. https://doi.org/10.1016/j.jclinepi.2023.04.017 CrossRef Google Scholar PubMed

Janssens, ACJW, Gwinn, M, Brockman, JE, Powell, K, Goodman, M. Novel citation-based search method for scientific literature: a validation study. BMC Med Res Methodol. 2020;20:25. https://doi.org/10.1186/s12874-020-0907-5 CrossRef Google Scholar PubMed

Nicholson, JM, Mordaunt, M, Lopez, P, et al. Scite: a smart citation index that displays the context of citations and classifies their intent using deep learning. Quant Sci Stud. 2021;2(3):882–898. https://doi.org/10.1162/qss_a_00146 CrossRef Google Scholar

Smalheiser, NR, Schneider, J, Torvik, VI, Fragnito, DP, Tirk, EE. The citation cloud of a biomedical article: a free, public, web-based tool enabling citation analysis. J Med Libr Assoc JMLA. 110(1):103–108. https://doi.org/10.5195/jmla.2022.1117 Google Scholar

Belter, CW. Citation analysis as a literature search method for systematic reviews. J Assoc Inf Sci Technol. 2016;67(11):2766–2777. https://doi.org/10.1002/asi.23605 CrossRef Google Scholar

Belter, CW. A relevance ranking method for citation-based search results. Scientometrics. 2017;112(2):731–746. https://doi.org/10.1007/s11192-017-2406-y CrossRef Google Scholar

Boyack, KW, Klavans, R. Co-citation analysis, bibliographic coupling, and direct citation: which citation approach represents the research front most accurately? J Am Soc Inf Sci Technol. 2010;61(12):2389–2404. https://doi.org/10.1002/asi.21419 CrossRef Google Scholar

Guo, E, Gupta, M, Deng, J, Park, YJ, Paget, M, Naugler, C. Automated paper screening for clinical reviews using large language models: data analysis study. J Med Internet Res. 2024;26:e48996. https://doi.org/10.2196/48996 CrossRef Google Scholar PubMed

Khraisha, Q, Put, S, Kappenberg, J, Warraitch, A, Hadfield, K. Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res Synth Methods. 2024;15(4):616–626. https://doi.org/10.1002/jrsm.1715 CrossRef Google Scholar PubMed

Tran, VT, Gartlehner, G, Yaacoub, S, et al. Sensitivity and specificity of using GPT-3.5 turbo models for title and abstract screening in systematic reviews and meta-analyses. Ann Intern Med. 2024;177(6):791–799. https://doi.org/10.7326/M23-3389 CrossRef Google Scholar PubMed

Waffenschmidt, S, Knelangen, M, Sieben, W, Bühn, S, Pieper, D. Single screening versus conventional double screening for study selection in systematic reviews: a methodological systematic review. BMC Med Res Methodol. 2019;19(1):132. https://doi.org/10.1186/s12874-019-0782-0 CrossRef Google Scholar

O’Hearn, K, MacDonald, C, Tsampalieros, A, et al. Evaluating the relationship between citation set size, team size and screening methods used in systematic reviews: a cross-sectional study. BMC Med Res Methodol. 2021;21(1):142. https://doi.org/10.1186/s12874-021-01335-5 CrossRef Google Scholar PubMed

Kellermeyer, L, Harnke, B, Knight, S. Covidence and Rayyan. J Med Libr Assoc JMLA. 2018;106(4):580–583. https://doi.org/10.5195/jmla.2018.513 Google Scholar

Figure 1 BibliZap web application input interface. User interface for launching a BibliZap query from the public web app (https://biblizap.org). Users input one or several seed identifiers (PMIDs, DOIs, or Lens IDs) separated by spaces, select the citation depth, direction (backward, forward, or both), and the number of results to retrieve. The application is powered by The Lens and includes a Firefox plugin for quick access from article pages.

Figure 2 Overview of BibliZap’s citation searching process. Starting from one or more seed articles (bottom left), BibliZap performs forward (blue arrow: articles citing the seed) and backward (orange arrow: articles cited by the seed) citation exploration. The first expansion level (Depth 1) retrieves all directly linked articles. These are then used as new seeds to perform a second round of forward and backward citation searching (Depth 2), further expanding the network. In the figure, articles are color-coded to allow visual identification of their propagation through the network. The final output includes all unique articles retrieved across both expansion levels. Each article is assigned a relevance score, defined by the number of times it appears across all citation paths. As illustrated in the right panel, articles that appear most frequently receive the highest scores and are ranked accordingly, enabling transparent prioritization for screening.

Figure 3 Example of BibliZap output interface. Screenshot of the web application output table after running a citation query. Retrieved articles are ranked by a cumulative relevance score and presented with their metadata, including DOI, title, journal, first author, publication year, summary, citation count, and score. Users can filter, sort, and download results for further screening or integration into reference management tools.

Figure 4 Proportion of relevant articles retrieved per systematic review, according to the three main citation searching configurations. Each horizontal bar represents one review (N = 66). Stacked segments indicate the proportion of included (gold-standard) articles retrieved using PubMed only (Approach 1, dark blue), PubMed supplemented with the top 500 BibliZap results (Approach 2, grey), and PubMed supplemented with the full BibliZap output (Approach 3, black). Bars are ordered by overall sensitivity within each review.

Figure 5 Proportion of systematic reviews (N = 66) meeting predefined sensitivity thresholds (≥80%, ≥90%, and 100%) across the five search strategies. Bars represent mean proportions with 95% confidence intervals. Approaches include: PubMed only (Approach 1, dark blue), PubMed + BibliZap top 500 (Approach 2, grey), PubMed + BibliZap full output (Approach 3, black), PubMed early stop + BibliZap (Approach 4, light blue), and PubMed late stop + BibliZap (Approach 5, teal).

Figure 6 Cumulative recall curve for the PubMed + full-seed BibliZap strategy (Approach 3). Cumulative sensitivity is plotted as a function of the number of articles screened, in descending relevance order (Best Match for PubMed; BibliZap relevance score for citation searching). The transition between PubMed (dark blue) and BibliZap (black) corresponds to the point where PubMed screening ends and citation searching begins. Right panel: among gold-standard articles not retrieved by PubMed (Approach 1), the cumulative proportion recovered by BibliZap (Approach 3) is plotted against the number of ranked articles screened. Shaded areas represent 95% confidence intervals.

Bentegeac et al. supplementary material

DOI: https://doi.org/10.1017/rsm.2026.10079.sm001

File 36.6 KB

Article contents

BibliZap: An exploratory evaluation of an automated multi-level citation searching tool for systematic and rapid reviews

Abstract

Keywords

Information

Highlights

What is already known?

What is new?

Potential impact for journal readers

1 Introduction

2 Methods

2.1 Description of the BibliZap tool

2.2 Gold-standard corpus

2.3 Search strategies

2.4 Metrics and statistical analysis

3 Results

3.1 BibliZap interface and workflow

3.2 Overall sensitivity and screening burden

3.3 Cumulative recall

4 Discussion

Acknowledgments

Author contributions

Competing interest statement

Data availability statement

Funding statement

Ethics statement

Supplementary material

Footnotes

References

Bentegeac et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests