Automated citation searching in systematic review production: A simulation study

Darren Rajit; Lan Du; Helena Teede; Joanne Enticott

doi:10.1017/rsm.2024.15

Automated citation searching in systematic review production: A simulation study

Published online by Cambridge University Press: 07 March 2025

Darren Rajit ,

Lan Du ,

Helena Teede and

Joanne Enticott

Show author details

Darren Rajit: Affiliation:
Monash Centre for Health Research and Implementation, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, Victoria, Australia
Lan Du: Affiliation:
Department of Data Science and AI, Faculty of Information Technology, Monash University, Clayton, Victoria, Australia
Helena Teede: Affiliation:
Monash Centre for Health Research and Implementation, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, Victoria, Australia Monash Partners Academic Health Sciences Centre, Clayton, Victoria, Australia
Joanne Enticott*: Affiliation:
Monash Centre for Health Research and Implementation, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, Victoria, Australia Monash Partners Academic Health Sciences Centre, Clayton, Victoria, Australia
*: Corresponding author: Joanne Enticott; Email: joanne.enticott@monash.edu

Article contents

Abstract
Highlights
Introduction
Aims
Methods
Results
Factors influencing automated citation searching performance
Discussion
Conclusion
Author contributions
Competing interest statement
Data availability statement
Funding statement
Footnotes
References

Rights & Permissions

Abstract

Bibliographic aggregators like OpenAlex and Semantic Scholar offer scope for automated citation searching within systematic review production, promising increased efficiency. This study aimed to evaluate the performance of automated citation searching compared to standard search strategies and examine factors that influence performance. Automated citation searching was simulated on 27 systematic reviews across the OpenAlex and Semantic Scholar databases, across three study areas (health, environmental management and social policy). Performance, measured by recall (proportion of relevant articles identified), precision (proportion of relevant articles identified from all articles identified), and F1–F3 scores (weighted average of recall and precision), was compared to the performance of search strategies originally employed by each systematic review. The associations between systematic review study area, number of included articles, number of seed articles, seed article type, study type inclusion criteria, API choice, and performance was analyzed. Automated citation searching outperformed the reference standard in terms of precision (p < 0.05) and F1 score (p < 0.05) but failed to outperform in terms of recall (p < 0.05) and F3 score (p < 0.05). Study area influenced the performance of automated citation searching, with performance being higher within the field of environmental management compared to social policy. Automated citation searching is best used as a supplementary search strategy in systematic review production where recall is more important that precision, due to inferior recall and F3 score. However, observed outperformance in terms of F1 score and precision suggests that automated citation searching could be helpful in contexts where precision is as important as recall.

Keywords

automation evidence synthesis guideline development learning health systems scoping review systematic reviews

Information

Type: Research Article
Information: Research Synthesis Methods , Volume 16 , Issue 1 , January 2025 , pp. 211 - 227

DOI: https://doi.org/10.1017/rsm.2024.15 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open materials Open data
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology

Highlights

What is already known?

• Citation searching has been recommended as s supplementary search method in systematic review production; however, manual methods are expensive in terms of effort and time.
• The rise of bibliographic aggregators such as OpenAlex and Semantic Scholar presents promise for automated forms of the technique, but there have been limited studies as to how they perform against standard search methods, what factors may influence performance, and how best to integrate this into existing systematic review production workflows.

What is new?

• Our work simulated automated citation searching across 27 systematic reviews across three different study areas (biomedical sciences/health, social policy, and environmental management) and two different bibliographic aggregators (OpenAlex and Semantic Scholar).
• This work is novel as citation searching is often recommended as a supplementary method; however, there is limited empirical evidence evaluating automated forms of the techniques, particularly across disciplines, and over different databases.
• The method outperformed standard search strategies in terms of efficiently retrieving articles that are relevant to a systematic review question efficiently (measured by precision) but was not effective in retrieving all possible relevant articles as a whole (measured by recall).
• Study area was found to significant influence performance, with performance being higher in the environmental management literature, compared to the social policy literature.

Potential impact for Research Synthesis Methods readers

• We found that automated citation searching that leverages direct citations should be used as a supplementary search strategy, rather than a stand-alone strategy.
• However, due to its better efficiency, it can be integrated without overly burdening downstream workload in terms of title and abstract screening.
• Teams who wish to integrate the technique should consider the citation activity of authors in their area, as the technique may perform better in areas where “signpost” articles are common. For example, articles such as consensus statements, guidelines or diagnostic criteria.

1 Introduction

Systematic review production and associated forms of evidence synthesis are crucial toward ensuring the best external evidence is used to inform policy and clinical practice. However, while traditional systematic review methods are robust, they are mainly manual. This presents a mismatch between the time it takes to synthesize and translate research evidence and the pace of research evidence production.Reference Rajit, Johnson, Callander, Teede and Enticott¹ Given that efficient evidence synthesis is key to “learning health systems,” where evidence from stakeholders, research, practice, and implementation is seamlessly integrated to drive healthcare improvement, new methods for evidence synthesis are needed to improve efficiencies, while maintaining requisite rigor.Reference Rajit, Johnson, Callander, Teede and Enticott¹

In response, the past decade has seen the rise of technological enablers to support evidence synthesis. Mega bibliographic databases such as OpenAlexReference Priem, Piwowar and Orr² and Semantic ScholarReference Kinney, Anastasiades, Authur, Beltagy, Bragg and Buraczynski³ now provide programmatic access via application programming interfaces (APIs) to aggregate subject specific sources such as PubMed and preprint servers such as ArXiV and MedRXiv. This enables potential automation of evidence retrieval that utilizes citations networks and links as sources for articles that are potentially relevant for a particular review question.

Specifically, citation searching or “snowballing” leverages citations and references (citation network) of a “seed” article, for retrieving relevant articles for a particular systematic review question.Reference Wohlin⁴^– Reference Wright, Golder and Rodriguez-Lopez⁶ This relies on the citation activity of article authors and the implicit knowledge contained in these citation links to identify relevant articles. This has advantages over the de facto standard, the Boolean-logic based keyword search, due to not needing to rely on the systematic review team’s own knowledge of potential keywords thus potentially improving the comprehensiveness of a search strategy, particularly in instances where terminology is not well defined.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald⁵ However, employing citation searching as a supplementary search strategy in conjunction with Boolean-logic-based keyword searches is slow when conducted manually. The adoption of automated methods that leverage APIs such as OpenAlex and Semantic Scholar offers substantial efficiencies in evidence retrieval phases of systematic reviews. This is particularly in living guidelinesReference McDonald, Hill, Li and Turner⁷ and maps.Reference Boutron, Chaimani, Meerpohl, Hróbjartsson, Devane and Rada⁸ A recent scoping reviewReference Hirt, Nordhausen, Appenzeller-Herzog and Ewald⁵ uncovered two examples of such tools: CitationChaser, an opensource R application that leverages the Lens.org database,Reference Haddaway, Grainger and Gray⁹ and CitationCloud, a publicly available extension of PubMed that allows the visualization of the citation network of an individual paper, with a focus on biomedical sciences.Reference Smalheiser, Schneider, Torvik, Fragnito and Tirk¹⁰

However, there has been limited investigation into how automated citation searching performs when compared against current standard methods. There is limited guidance on how automated citation searching may be integrated into systematic review workflows, and on optimal circumstances for the technique. Additionally, given the reliance of the technique on data availability and citation activity across different study areas, understanding of potential biases and limitations is crucial.

2 Aims

The study aims to:

1. Simulate and evaluate the use of exclusively automated citation searching for evidence retrieval compared to reference standard search strategies employed in systematic reviews. We will examine this approach across three broad study areas: Public health and biomedical sciences; environmental management; and social policy.
2. Evaluate the factors that influence the performance of automated citation searching, including i) automated citation search parameters, ii) review question and included article parameters, and iii) seed article parameters.

3 Methods

A protocol has been published a priori. Reference Rajit, Du, Teede, Callander and Enticott¹¹ Figure 1 highlights the high-level approach repeated in sample systematic reviews. Python code and data devised to run the simulation and subsequent analyses is available on GitHub (https://github.com/darrenkjr/automated_citation_search_study).

Figure 1

Framework depicting high level methodology of the simulation study. Adapted from protocol (11).

3.1 Reference systematic review retrieval

Systematic reviews were randomly selected as outlined in the protocolReference Rajit, Du, Teede, Callander and Enticott¹¹ and screened against prespecified criteria outlined in Table 1 for inclusion in the dataset. Ten systematic reviews from each of the three study areas were randomly selected producing a random sample of 30 to be screen using the inclusion and exclusion criteria (Table 1). The three study areas and relevant databases were: Public health and biomedical sciences captured through the Cochrane Database of Systematic Reviews (CSDR); environmental management captured through the Collaboration for Environmental Evidence Database of Evidence Reviews (CEEDER); and social policy captured through Campbell Reviews.

Table 1

Inclusion and exclusion criteria for sample systematic reviews included in studyReference Rajit, Du, Teede, Callander and Enticott¹¹

Key characteristics (Title, Source Database, Publication Year, Search Strategy Type, Study Type Inclusion Criteria, Peer reviewed literature vs Grey Literature vs Peer reviewed & Grey Literature, Number of Included Articles) of each systematic review were also retrieved (Supplementary Tables S1 and S2).

3.2 Included article extraction

All articles originally deemed eligible for inclusion by original systematic review authors in the data extraction phase of reference systematic reviews were extracted. These are denoted as “Included articles” in the rest of this paper. Where possible, the titles, abstracts, and relevant unique identifier (Digital Object Identifier (DOI), PubMed Identifier (PMID), or Microsoft Academic Graph Identifier (MAG ID)) were also retrieved Included articles were then used to both i) compute the intracluster semantic similarity of each corresponding reference systematic review and as a ii) reference standard to evaluate the performance of the original search strategies. Detailed original search strategies are available in the Supplementary Appendix (Table S1).

3.3 Intracluster semantic similarity calculations

Titles and abstracts of included articles were used to compute the intracluster semantic similarity of each corresponding reference systematic review. This represented thematic coherence or topic complexity for each review.

First, the titles and abstracts were encoded as numerical vectors, known as embeddings. The semantic similarity between included articles for each systematic review was calculated using cosine similarity, where the cosine of the angle between two vectors (encoding the representation of the title and abstract of a particular included article) is measured. A cosine value would range from −1 to 1, where −1 would imply opposite meanings, 0 would imply no similarity at all between texts, and 1 indicating identical content.

The pairwise cosine similarity between the vectors of all included articles’ combined titles and abstracts was then computed for each systematic review. The intracluster semantic similarity for a particular reference systematic review would then be determined by averaging these pairwise cosine similarity scores. A higher intracluster semantic similarity would thus suggest a more focused and specific systematic review topic, while a lower score would imply a broader or more complex systematic review topic.

3.4 Reference search strategy performance

Included articles were used as the reference standard or “ground truth” for evaluating the performance of the original search strategy of each reference systematic review. Recall, precision, F1 score, F2 score, and F3 score were employed as performance measures (Table 2).

Table 2

Performance measures employed in study (recall, precision, and F score)

^a X is the total retrieved documents obtained from a particular search strategy (original search strategy employed by reference systematic review or automated citation searching), and Y is the total set of relevant documents eligible for inclusion in the sample systematic review.

To compute recall for the reference search strategy of each reference systematic review, it was assumed that all relevant eligible articles were retrieved (Y in Table 2). Thus, recall was set at 100% for all reference search strategies. To compute precision, the number of all articles retrieved per systematic review (X in Table 2) was extracted from the systematic review PRISMA diagram or results section, and the formula as in Table 2 applied. Consequently, F- $\unicode{x3b2}$ scores were computed assuming 100% recall in the case of the reference search strategies employed by each reference systematic review and applying the formulae in Table 2.

3.5 Seed article extraction

In order to simulate the worst-case scenario in which systematic reviewers have no prior knowledge of the current state of the literature, it is assumed that systematic reviewers will select articles that both i) represent their review question at hand and ii) would presumably be cited by authors of articles that should be included in systematic review. For example, articles that represent underlying consensus in a study area such as prior reviews, consensus definitions, or outcome constructs. It is further assumed that such articles are typically cited in the background section to justify the conducting of the systematic review (e.g., needing to update a prior review), or in the methods section as a way to specify the inclusion or exclusion criteria (e.g., citing a consensus definition of a chronic disease to define the population component of a PICO question). Thus, articles from these sections were extracted as seed articles, forming a corresponding seed article pool for each sample systematic review.

The i) DOI/PMID, ii) Title, iii) year published, iv) number of citations, and v) number of references were then retrieved. Seed articles were classified as: Research Article, Evidence Syntheses, Consensus Article, Methodology Article, Commentary Article, Framework Article, and Other (including grey literature such as book chapters and reports).

Articles that i) were included articles but also cited in the background or methods section, or ii) had more than 10,0000 citations, or iii) did not have a retrievable DOI or PMID were excluded from the seed article pool. Included articles were excluded due to potentially introducing bias as the simulation study is meant to simulate the worst-case scenario where teams have no prior knowledge, and articles more than 10,000 citations were excluded due to practical considerations and computational limitations (See Supplementary file for details).

3.6 Automated citation searching simulation

Seed articles were used to kickstart automated citation searching processes for each corresponding systematic review. This was conducted on two database APIs: OpenAlex and Semantic Scholar. Both were chosen due to their extensive coverage of over 200 million records that incorporates Microsoft Academic Graph,Reference Singh Chawla¹²^, ¹³ which has been shown to have superior coverage over database alternatives such as Dimensions, Scopus, and CrossRef.Reference Visser, van Eck and Waltman¹⁴ Further, API access to both databases was provided free of charge for research purposes.

As in Figure 2, automated citation searching yielded a citation network for each seed article. Only direct citations (both backward and forward) that were within one hop of the citation network for a specific seed article were retrieved. This citation network was then evaluated according to recall, precision and F1–F3 score, utilizing the included articles as the reference standard for evaluation, and applying the formulae in Table 2.

Figure 2

Schematic depicting the automated citation searching simulation process.

Results that had a recall of 0 were excluded from further evaluation. All possible unique combinations of the citation networks of each seed article were then iteratively combined and evaluated. Each unique combination that was evaluated at this stage was recorded as an individual citation searching run. Due to computational constraints, only the citation networks of the top 10 seed articles in terms of recall per systematic review were combined and evaluated. In situations where there were less than 10 seed articles with non-zero recall, all seed article citation network combinations that had a non-zero recall were evaluated.

3.7 Analysis

The top-performing automated citation searching run for each sample systematic review were then identified based on recall (with F3 score as a tiebreaker) and compared to performance of the reference search strategy of the systematic review, irrespective of the API used.

Three main categories of factors were examined in relation to automated citation searching performance, as quantified by recall, precision, F1 score, F2 score, and F3 score. First, factors related to the review question: intracluster semantic similarity, study type inclusion criteria (gray literature vs peer-reviewed literature), and study area (CEEDER vs Campbell vs Cochrane). Second, seed article characteristics, specifically study type and third, citation searching parameters, including API choice (OpenAlex vs Semantic Scholar) and the number of seed articles used were examined. The spearman’s rank correlation matrix was employed for all numerical variables, whereas the Kruskal–Wallis and post-hoc Mann–Whitney U hypothesis test was employed for categorical data (study area, study type inclusion criteria, seed article study type, and API choice). Both approaches were chosen due to the presence of outliers. Analysis for each factor except for “API Choice” was conducted on the top performing run for each systematic review, irrespective of the API used. For the factor of “API choice,” the best performing run for each systematic review generated from each individual API was extracted and compared.

3.8 Extensions from initial protocol

In execution, some protocol modifications were required (11). See Supplementary File for details.

4 Results

4.1 Dataset

Systematic reviews were randomly selected (n = 30) and screened against prespecified criteria outlined in Table 1 for inclusion in the dataset. Of the originally planned sample size of 30, only 27 systematic reviews met inclusion and exclusion criteria (Table 1), 10 from CEEDER, 9 from Campbell Systematic Reviews, and 8 from CDSR.

The dataset is composed of 27 systematic reviews, consisting of 10 systematic reviews from the CEEDER, representing the environmental management literature, 9 from Campbell Reviews representing the social policy literature, and 8 from the CDSR, representing the health literature.

In total, 25.9% (7/27) of systematic reviews included only peer-reviewed literature as part of their inclusion criteria, while 74.1% (20/27) included both the grey and peer-reviewed literature. Reviews from the CSDR were most likely to only consider peer-reviewed literature (7 out of 8), whereas only 1 out of 10 reviews from CEEDER considered peer reviewed literature only. Lastly, all Campbell Reviews considered both peer-reviewed and gray literature.

All systematic reviews employed Boolean search strategies. The most common supplementary search strategy was handsearching of specific journals and repositories (59.3%, n = 16) followed by backward citation searching only (48.1%, n = 13), expert consultation (40.7%, n = 11), a full citation search of select articles (33.3% n = 9), screening articles from previous versions of the review (11.1% n = 3), screening articles from a prior evidence map (11.1% n = 3), crowdsourcing through social media (7.4%, n = 2), and forward citation searching only (3.7%, n = 1).

As in Table 3, each systematic review contained a median of 42 (interquartile range [IQR]: 51.5) eligible articles (included article). Review topic complexity as measured by intracluster semantic similarity was moderate, with an average of 0.846 (IQR: 0.065).

Table 3

Median number of included articles (IQR) and average intracluster semantic similarity (±SD) for systematic reviews in each source database, and all reviews in the dataset

No significant differences in the number of included articles and intracluster semantic similarity were observed across the study areas.

A median of 29 (IQR: 31) seed articles were extracted from each systematic review, resulting in a total of 1024 seed articles extracted. This consisted of 35.7% (n = 366) research articles, 26.4% (n = 270) evidence synthesis articles, 17.7% (n = 181) methodology articles, 11.4% (n = 117) commentary articles, 2.44% (n = 25) framework articles, and 1.56% (n = 16) consensus articles. An additional 4.76% (n = 49) articles were classified as “Other,” composed of the gray literature.

Table 4 depicts baseline characteristics of the seed articles retrieved from OpenAlex and Semantic Scholar respectively. Overall, the median number of references per seed article was higher in the Semantic Scholar compared to the OpenAlex API. This was similarly the case for both median citations per seed article and median citation network size per seed article.

Table 4

Summary baseline characteristics of seed articles successfully retrieved from the OpenAlex and semantic scholar APIs

^a Represents sum of number of references and citations.

^b Aggregate of all article types. Only candidates with retrievable DOIs were extracted and retrieved.

^c Grey literature, includes datasets, working papers, reports, and so on.

4.2 Original systematic review search strategy performance

As seen in Table 5, original systematic review search strategy performance was poor in terms of precision, with the typical review having a median precision of 0.83% (IQR: 3.29), median F1 score of 0.02 (IQR: 0.063), median F2 score of 0.04 (IQR: 0.14), and median F3 score of 0.08 (0.236). Reference search strategy performance was found to be significantly higher in terms of median precision, F1 score, F2 score, and F3 score in the CDSR reviews compared to the Campbell Reviews (Supplementary Table S5).

Table 5

Median (IQR) precision, F1 score, F2 score, and F3 score for all search strategies employed by the systematic reviews in the dataset

^a Significant difference between CDSR and Campbell reviews (adjusted p < 0.05).

4.3 Performance of automated citation searching

Overall performance of automated citation searching was poor, with median recall across all sample systematic reviews at 35.79% (IQR: 33.46%), median precision at 2.57% (IQR: 3.64%), median F1 score at 0.048 (IQR: 0.047), median F2 score at 0.031 (IQR: 0.044), and median F3 score at 0.028 (IQR: 0.040).

4.4 Automated citation searching vs reference search strategies

As pictured in Figure 3A, the automated method outperformed the reference search strategy in terms of precision in 70.4% (19/27) of cases. However, observed out-performance started to deteriorate once recall was weighted, with observed outperformance in terms of F1 score dropping to 67% (18/27) of cases (Figure 3B). This further dropped to 48.1% (13/27) of cases when recall was weighted at two times as important as precision (F2 score, Figure 3C), and finally to 11.1% (3/27) when recall was weighted as three times important as precision (F3 score, Figure 3D).

Figure 3

(A-D) Comparison of Automated Citation Searching Performance (Best Performing Run) vs Search Strategies employed by Sample Systematic Review, by Precision, F1 Score, F2 Score and F3 score. Observations above dotted line indicates out-peformance of automated method vs reference standard.

As summarized in Table 6, observed outperformance by automated citation searching vs the reference search strategy was significant in terms of precision and F1 score, with a median precision of 2.574% (IQR: 3.637) compared with 0.832% (IQR: 3.269) and a median F1 score of 0.048 (IQR: 0.047) compared to 0.016 (IQR: 0.063). However, the reference search strategy significantly outperformed in recall and F3 score (Table 6), though this assumes that the original systematic review had retrieved all possible relevant for articles for inclusion.

Table 6

Performance (precision, F1 score, F2 score, and F3 score) of automated citation searching vs reference systematic review search strategies

^a Significant difference between automated citation searching and reference systematic review search strategy (p < 0.05).

^b Assumes that reference systematic review had retrieved all possible relevant articles for inclusion, thus set to 100%.

5 Factors influencing automated citation searching performance

5.1 Significant factors: Study area

Among the factors examined, only study area significantly influenced automated citation searching performance, affecting precision, F1, F2, and F3 scores.

Table 7 summarizes the best performing automated citation searching runs across different APIs, categorized by systematic review subsets. While recall was highest in the CEEDER subset, followed by Campbell and CDSR, the observed variation across these subsets was not significant. In terms of precision, F1, F2, and F3 scores, performance was highest in the CDSR subset, followed by CEEDER, and then Campbell. The differences were significant between the CEEDER and Campbell subsets, as detailed in Table 7.

Table 7

Median (IQR) recall, precision, F1 score, F2 score, and F3 score of the best performing automated citation searching runs, by systematic review subsets

^a Aggregate of all systematic reviews.

^b Significant difference between CEEDER and Campbell subsets (adjusted p < 0.05).

5.2 Nonsignificant factors

Other examined factors did not show significant effects. However, there were some interesting trends as summarized in Tables S3-S4 in the Supplementary Material.

First, automated citation searching tended to perform better on sample systematic reviews that only included the peer-reviewed literature compared to systematic reviews that included both the peer reviewed literature and the gray literature, with higher median recall, precision, and F scores. In terms of seed article types, framework and consensus articles yielded the top two highest median recall scores at 29.55% (IQR: 20.50) and 23.33% (IQR: 10.00), respectively. This was followed by “other” articles, methodology articles, research articles, evidence synthesis articles, and lastly commentary articles. However, once precision was weighted through the F scores, commentary articles merged as the leading type of seed article, followed by evidence synthesis articles, framework articles, “other” articles, methodology articles, and lastly consensus articles. There were marginal differences in performances between the two APIs tested, with automated citation searching runs through the Semantic Scholar API exhibiting a higher median recall and F1 score compared to the API. However, runs from the OpenAlex API tended to exhibit higher precision, F2 scores and F3 scores, respectively.

Additionally, Intracluster semantic similarity tended to show moderate positive correlation with recall, and weak positive correlations with all other performance measures. On the other hand, the number of seed articles used in a particular automated searching run was found to have limited correlation with recall and was negatively correlated with all other performance measures. Lastly, the number of included articles extracted per sample systematic review also showed weak negative correlations with precision, and limited correlation with all other performance metrics.

5.3 Baseline retrievability of included articles

As illustrated in Table 8, the median percentage of included articles with valid IDs in the typical systematic review was 86.4% (IQR: 12.05%). Systematic reviews in the Campbell subset had the highest median percentage of valid IDs, followed by CDSR and CEEDER, yet the differences were not statistically significant.

Table 8

Median % (IQR) of included articles with Valid IDs extracted from systematic reviews in dataset, and baseline retrievability rate of included articles across both APIs: (OpenAlex, Semantic Scholar)

^a Valid IDs refer to PMIDs, DOI, or MAG IDs.

However, the baseline retrievability rate of included articles across each API (OpenAlex and Semantic Scholar) were lower than the percentage of included articles that had valid IDs, indicating potential deficits in database coverage. Differences in retrieval rates across both APIs were nonsignificant.

5.4 Automated citation searching performance across benchmarks

Figure 4A compares the recall of the best performing automated citation searching run, irrespective of API, against three recall thresholds: 50%, 80%, and 100%. As shown, 100% recall was achieved for only 1 case, the 80% threshold was exceeded in 11.1% (3/27) of cases, and the 50% recall threshold was exceeded in 37% (10/27) of cases.

Figure 4

Recall of automated citation searching for each systematic review against various level of recall (A), and against the baseline retrievability rate of included articles of each systematic review (B).

Similarly, Figure 4B compares recall against the baseline retrievability rate of included articles. Recall matched the baseline rate in only one case and exceeded the 80% threshold in 14.8% (4/27) of cases and the 50% threshold in 40.7% (11/27) of cases, suggesting potential for improvement in the automated technique.

6 Discussion

To our knowledge, this is the first study of its kind that investigates performance differences in automated citation searching across different study areas and further investigates potential factors that influence performance. Additionally, prior related simulation studies utilized different variants of automated citation searching, ranging from co-citation variants,Reference Janssens, Gwinn, Brockman, Powell and Goodman¹⁵ to citation clusters,Reference Bascur, Verberne, van Eck and Waltman¹⁶ with most focus on the biomedical literature.Reference Bascur, Verberne, van Eck and Waltman¹⁶^, Reference Janssens and Gwinn¹⁷ Additionally, prior related work has utilized different databases, ranging from the Web of Science (WOS),Reference Janssens, Gwinn, Brockman, Powell and Goodman¹⁵ to Lens.org,Reference Haddaway, Grainger and Gray⁹ to the Dimensions database,Reference Bascur, Verberne, van Eck and Waltman¹⁶ and lastly PubMed.Reference Smalheiser, Schneider, Torvik, Fragnito and Tirk¹⁰ Our work here investigates the use of both the OpenAlex and Semantic Scholar APIs. However, none of the simulation studiesReference Janssens, Gwinn, Brockman, Powell and Goodman¹⁵^– Reference Janssens and Gwinn¹⁷ had investigated the variant of automated citation searching as investigated here, where only the citation network within 1 “hop” of a seed article is retrieved.

6.1 Principal findings

Our results indicate that automated citation searching offers improved precision but struggles with recall compared to traditional methods. Thus, while it is efficient at retrieving relevant articles, it may miss a significant number of articles that would be found by conventional methods. As such, automated citation searching should be best used as a supplementary search strategy in traditional systematic review production, or as an initial scoping search tool for resource-constrained settings. As a standalone method, there is a risk of missing potentially relevant literature, and its suitability decreases as the need for capturing all of the relevant literature increases. This is evidenced by both poor recall, and F3 score (weighted average between recall and precision, where recall is 3x as important as precision). However, observed outperformance in F1 score (weighted average between recall and precision, where recall is 1x as important as precision) indicates that its integration as a supplementary method would not adversely affect the downstream workload when it comes to the screening process.

Performance of automated citation searching was study area dependent, notably with performance in terms of precision, F1 score, F2 score, and F3 score being significantly higher within the environmental management literature (as represented by systematic reviews from CEEDER) relative to the social policy literature, as represented by systematic reviews from Campbell reviews.

6.2 Study area and influence on performance

We hypothesize that observed performance differences across study areas may stem from i) varying levels of consensus on concepts, terms, and definitions, ii) differences in research question broadness, or iii) a combination of both. Authors in areas with high consensus are more likely to cite the same articles, resulting in citation networks that are more likely to yield relevant articles. This could result in enhanced automated citation searching performance. Despite a limited sample size, the high recall of Consensus and Framework articles as seed article types supports this observation (Table 8).

We further note that the CDSR (Cochrane) subset exhibited the lowest median recall and highest F3 score. This could possibly be due to the diverse range of research questions that were the subject of the Cochrane reviews in the dataset, ranging from health equity assessmentsReference Welch, Dewidar, Ghogomu, Abdisalam, Ameer and Barbeau¹⁸ to clinical interventions.Reference Mei, Wu, Zhao, Hu, Gao and Chen¹⁹ Furthermore, the nonretrievability of clinical trials and conference abstracts by both OpenAlex and Semantic Scholar APIs may have contributed to poorer performance in the Cochrane subset.

6.3 Limitations of automated citation searching

Current methods of automated citation searching relies heavily on unique identifiers such as DOIs,Reference Haddaway, Grainger and Gray⁹ PMIDs,Reference Haddaway, Grainger and Gray⁹^, Reference Smalheiser, Schneider, Torvik, Fragnito and Tirk¹⁰ and MAG IDsReference Haddaway, Grainger and Gray⁹ to identify, disambiguate, and retrieve articles. This method is further constrained by the coverage of APIs that provide access to these IDs and citation links necessary for building citation networks. We noted through the course of our research that the absence of valid DOIs and other IDs primarily came from grey literature, clinical trial reports and conference abstracts. This puts a theoretical limit on the performance that is achievable via automated citation searching and other methods that rely on such unique identifiers. Nonetheless, as indicated by Figure 5B, the gap between the current performance of automated citation searching and what can be theoretically achieved suggests that more sophisticated methods, such as co-citations,Reference Janssens, Gwinn, Brockman, Powell and Goodman¹⁵^, Reference Bascur, Verberne, van Eck and Waltman¹⁶ additional “hops” through the citation network beyond just 1 “hop” as tested in this study,Reference Robinson, Dunn, Tsafnat and Glasziou²⁰ and vector-based retrieval strategies,Reference Hashimoto, Kontonatsios, Miwa and Ananiadou²¹ can yield further improvements in performance. Additionally, whilst this was not investigated in the current study, the number of backwards and forwards citations of a seed article may also influence the final performance of the technique, and future work should investigate where this should be a factor in seed article selection.

6.4 Recommendations for current use

Despite its limitations as a standalone method, automated citation searching still offers distinct advantages in terms of speed, replicability, and convenience. However, there are a dearth of publicly available and accessible tools to allow for adoption. To this end, a publicly available web-appReference Rajit²² leveraging the databases tested in this study (OpenAlex and Semantic Scholar) is available for use and further testing: https://darrenkjr-automatedcitationsearch.streamlit.app/. Our findings suggest that automated citation searching may be best used as a supplementary strategy in study areas with high consensus on research direction, diagnostic criteria, or definitions. Selection of seed articles should reflect such consensus, using articles like core outcome sets and diagnostic criteria as “signposts.” Automated citation searching may also have potential in resource constrained contexts where recall is as important as precision, such as in rapid reviews or surveillance searches in the context of living guideline updates.Reference McDonald, Hill, Li and Turner⁷

6.5 System-level approaches toward improving automated citation searching performance

Our work suggests that the performance of automated citation searching is currently limited by technical aspects related to API coverage and socioecological aspects related to the citation activity of authors. From an API coverage perspective, better support for grey literature and clinical trial identification may improve performance, alongside improvements to data processing. Current methods rely on the parsing of full text PDFs to extract citation links, which is non-trivial. An alternative would be the development of an alternative format for journal article publishing built for interoperability in terms of data sharing and machine readability. For example, such has been done with digital health via the Fast Healthcare Interoperability Resources (FHIR) standardReference Vorisek, Lehne, Klopfenstein, Mayer, Bartschke and Haese²³ and has recently proposed by Haddaway et al.Reference Haddaway, Gray and Grainger²⁴ for systematic reviews and other evidence syntheses. From a socioecological perspective, increased adoption of consensus building activities within fields such as core outcome sets and evidence-based guidelines may also yield further improvements in downstream automated citation searching performance, beyond improvements in tackling research waste and research transparency.Reference Matvienko-Sikar, Avery, Blazeby, Devane, Dodd and Egan²⁵

6.6 Future directions

Our work evaluates a simple form of automated citation searching in evidence synthesis and conducts an exploratory investigation into the situations and contexts where such methods may be best deployed. A publicly available webapp has been developed through this to allow for further testing in different contexts.Reference Rajit²² Despite poor performance in recall, preliminary results indicate that there is further scope for improvement on the technique particularly with other variants having shown promising results in the biomedical literature specifically.Reference Hirt, Nordhausen, Appenzeller-Herzog and Ewald⁵^, Reference Janssens, Gwinn, Brockman, Powell and Goodman¹⁵^, Reference Bascur, Verberne, van Eck and Waltman¹⁶ More work with a larger sample size investigating potential performance factors such as the citation network size of seed articles, and sensitivity analyses investigating the effect of using included articles as seed articles is also warranted to further optimize the technique and produce empirically derived guidance. However, for automated evidence synthesis to truly gain mainstream adoption, more efforts are needed to integrate what is currently a disparate tool chain with high technical hurdles for adoption; into a more user-friendly interface, crucially starting from the beginning of the evidence synthesis process, specifically the scoping and search strategy development phase. Future directions should be focused on current tool integration, combining automated evidence retrieval with automated title and abstract screening, and evaluating such tools across a diverse set of contexts and study areas. Additionally, as the support of more databases requires technical expertise, open-source efforts to pool resources should be encouraged to allow for greater user choice, and lower the technical gap to access such tools.

6.7 Limitations of the study

Our work here assumed that all the sample systematic reviews had retrieved all possible relevant articles that were eligible for inclusion when the original search was conducted. In practice, some eligible articles may have been missed by the original search, and the performance of the original search strategies might be overinflated. It is also possible that automated citation searching may have retrieved articles which may have been overlooked in the original systematic review due to the differences in coverage between the APIs and the databases employed by the systematic reviews. As such, recall for automated citation searching may have been underestimated. Further, both OpenAlex and Semantics Scholar have bespoke ID systems beyond the ID types that were used in this study. As such, included articles that may have been retrievable in either API may not have been uncovered, thus underestimating recall. Lastly, our seed article selection strategy leveraged articles from the Background and Methods sections of each sample systematic review, assuming the worst-case scenario where systematic reviews have no prior knowledge of included articles that could be relevant to the review question, with potentially no included articles that could be relevant. In reality, leveraging included articles could yield better results. As such, our results could be potentially underestimating its efficacy.

7 Conclusion

Automated citation searching is currently best used as a supplementary search strategy during evidence synthesis and systematic review production due to poor performance in terms of recall (captured less relevant articles compared to standard practice). However, it outperforms standard methods in terms of precision (proportion of relevant articles identified from all articles identified was better than standard practice). As a result, it may have other niche applications in initial scoping searches or rapid reviews. However, its suitability decreases as the need for higher recall increases, as evidenced by its poor performance in terms of F3 score (where recall is weighted 3x as important as precision). Nonetheless, it can be potentially integrated as a supplementary method without overly burdening the screening process as evidenced by its higher F1 score (where recall is as important as precision) relative to conventional methods. Lastly, the performance of automated citation searching is dependent on study area, potentially due to differing levels of consensus on aspects such as diagnostic criteria, research directions, and term definitions. As such, seed article choice in automated citation searching should take this aspect into account.

Acknowledgments

The authors wish to thank the Semantic Scholar and OpenAlex teams for access to their respective application programming interfaces (API) in support of generating the data required for this manuscript.

Author contributions

DR conceptualized and wrote this manuscript and developed the software underpinning the study. LD, HT, and JE contributed to conceptualization. All authors revised the manuscript.

Competing interest statement

The authors declare that no competing interests exists.

Data availability statement

The data that support the findings of this study are openly available in a publicly available Github repository: https://github.com/darrenkjr/automated_citation_search_study

Funding statement

DR is supported by an Australian Government Research Training Program (RTP) Scholarship. HT is funded by an NHMRC Fellowship. The Funders of this work did not have any direct role in the design of the study, its execution, analyses, interpretation of the data, or decision to submit results for publication.

Supplementary material

To view supplementary material for this article, please visit http://doi.org/10.1017/rsm.2024.15.

Footnotes

This article was awarded Open Data and Open Materials badges for transparent practices. See the Data availability statement for details.

References

Rajit, D, Johnson, A, Callander, E, Teede, H, Enticott, J. Learning health systems and evidence ecosystems: a perspective on the future of evidence-based medicine and evidence-based guideline development. Health Res Policy Sys. 2024;22(1):4. https://doi.org/10.1186/s12961-023-01095-2.Google Scholar PubMed

Priem, J, Piwowar, H, Orr, R. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. 2022. https://doi.org/10.48550/arXiv.2205.01833.CrossRef Google Scholar

Kinney, R, Anastasiades, C, Authur, R, Beltagy, I, Bragg, J, Buraczynski, A, et al. The Semantic Scholar Open Data Platform arXiv preprint arXiv; 2023. https://doi.org/10.48550/arXiv.2301.10140.CrossRef Google Scholar

Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. Association for Computing Machinery; 2014:1–10. https://doi.org/10.1145/2601248.2601268.Google Scholar

Hirt, J, Nordhausen, T, Appenzeller-Herzog, C, Ewald, H. Citation tracking for systematic literature searching: A scoping review. Res Synth Methods. 2023;14(3): 563–579. https://doi.org/10.1002/jrsm.1635.CrossRef Google Scholar PubMed

Wright, K, Golder, S, Rodriguez-Lopez, R. Citation searching: A systematic review case study of multiple risk behaviour interventions. BMC Med Res Methodology. 2014;14(1): 73. https://doi.org/10.1186/1471-2288-14-73.Google Scholar PubMed

McDonald, S, Hill, K, Li, HZ, Turner, T. Evidence surveillance for a living clinical guideline: Case study of the Australian stroke guidelines. Health Inf Libr J. 1–12. https://doi.org/10.1111/hir.12515.Google Scholar

Boutron, I, Chaimani, A, Meerpohl, JJ, Hróbjartsson, A, Devane, D, Rada, G, et al. The COVID-NMA Project: Building an evidence ecosystem for the COVID-19 pandemic. Ann Intern Med. 2020;15;173(12):1015–1017. https://doi.org/10.7326/M20-5261.CrossRef Google Scholar PubMed

Haddaway, NR, Grainger, MJ, Gray, CT. Citationchaser: A tool for transparent and efficient forward and backward citation chasing in systematic searching. Res Synth Methods. 2022;13(4): 533–545. https://doi.org/10.1002/jrsm.1563.CrossRef Google Scholar PubMed

Smalheiser, NR, Schneider, J, Torvik, VI, Fragnito, DP, Tirk, EE. The Citation Cloud of a biomedical article: A free, public, web-based tool enabling citation analysis. J Med Libr Assoc. 110(1): 103–108. https://doi.org/10.5195/jmla.2022.1117.Google Scholar

Rajit, D, Du, L, Teede, H, Callander, E, Enticott, J. Automated Citation Searching in Systematic Review Production: A Simulation Study Protocol and Framework. Accessed January 16, 2024. https://doi.org/10.22541/au.169028985.56828301/v1.CrossRef Google Scholar

Singh Chawla, D. Massive open index of scholarly papers launches. Nature. 2022; https://doi.org/10.1038/d41586-022-00138-y.CrossRef Google Scholar PubMed

Semantic Scholar – Academic Graph API. Accessed September 25, 2024. https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/get_graph_get_paper.Google Scholar

Visser, M, van Eck, NJ, Waltman, L. Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic. Quant Sci Studies. 2021;2(1):20–41. https://doi.org/10.1162/qss_a_00112.CrossRef Google Scholar

Janssens, ACJW, Gwinn, M, Brockman, JE, Powell, K, Goodman, M. Novel citation-based search method for scientific literature: A validation study. BMC Med Res Methodology. 2020;20(1):25. https://doi.org/10.1186/s12874-020-0907-5.Google Scholar PubMed

Bascur, JP, Verberne, S, van Eck, NJ, Waltman, L. Academic information retrieval using citation clusters: In-depth evaluation based on systematic reviews. Scientometrics. 2023;128(5):2895–2921. https://doi.org/10.1007/s11192-023-04681-x.Google Scholar

Janssens, ACJW, Gwinn, M. Novel citation-based search method for scientific literature: application to meta-analyses. BMC Med Res Methodol. 2015;15(1): 84. https://doi.org/10.1186/s12874-015-0077-z.Google Scholar PubMed

Welch, V, Dewidar, O, Ghogomu, ET, Abdisalam, S, Ameer, AA, Barbeau, VI, et al. How effects on health equity are assessed in systematic reviews of interventions. Cochrane Database Syst Rev. 2022;18;1(1):MR000028. https://doi.org/10.1002/14651858.MR000028.pub3.Google Scholar PubMed

Mei, F, Wu, M, Zhao, L, Hu, K, Gao, Q, Chen, F, et al. Probiotics for the prevention of Hirschsprung-associated enterocolitis. Cochrane Database Syst Rev. 2022;4(4):CD013714. https://doi.org/10.1002/14651858.CD013714.pub2.Google Scholar PubMed

Robinson, KA, Dunn, AG, Tsafnat, G, Glasziou, P. Citation networks of related trials are often disconnected: Implications for bidirectional citation searches. J Clin Epidemiol. 2014;67(7):793–799. https://doi.org/10.1016/j.jclinepi.2013.11.015.CrossRef Google Scholar PubMed

Hashimoto, K, Kontonatsios, G, Miwa, M, Ananiadou, S. Topic detection using paragraph vectors to support active learning in systematic reviews. J Biomed Inform. 2016;62:59–65. https://doi.org/10.1016/j.jbi.2016.06.001.Google Scholar PubMed

Rajit, D. An Open Source Web Application for Automated Citation Searching via Semantic Scholar and OpenAlex. Monash University; 2024. Accessed September 25, 2024. https://doi.org/10.26180/26785558.v2.CrossRef Google Scholar

Vorisek, CN, Lehne, M, Klopfenstein, SAI, Mayer, PJ, Bartschke, A, Haese, T, et al. Fast healthcare interoperability resources (FHIR) for interoperability in health research: Systematic review. JMIR Med Inform. 2022;10(7):e35724. https://doi.org/10.2196/35724.Google Scholar PubMed

Haddaway, NR, Gray, CT, Grainger, M. Novel tools and methods for designing and wrangling multifunctional, machine-readable evidence synthesis databases. Env Evid. 2021;10(1):5. https://doi.org/10.1186/s13750-021-00219-x.CrossRef Google Scholar

Matvienko-Sikar, K, Avery, K, Blazeby, JM, Devane, D, Dodd, S, Egan, AM, et al. Use of core outcome sets was low in clinical trials published in major medical journals. J Clin Epidemiol. 2022;142:19–28. https://doi.org/10.1016/j.jclinepi.2021.10.012.CrossRef Google Scholar PubMed

Figure 1 Framework depicting high level methodology of the simulation study. Adapted from protocol (11).

Table 1 Inclusion and exclusion criteria for sample systematic reviews included in study11

Table 2 Performance measures employed in study (recall, precision, and F score)

Figure 2 Schematic depicting the automated citation searching simulation process.

Table 3 Median number of included articles (IQR) and average intracluster semantic similarity (±SD) for systematic reviews in each source database, and all reviews in the dataset

Table 4 Summary baseline characteristics of seed articles successfully retrieved from the OpenAlex and semantic scholar APIs

Table 5 Median (IQR) precision, F1 score, F2 score, and F3 score for all search strategies employed by the systematic reviews in the dataset

Figure 3 (A-D) Comparison of Automated Citation Searching Performance (Best Performing Run) vs Search Strategies employed by Sample Systematic Review, by Precision, F1 Score, F2 Score and F3 score. Observations above dotted line indicates out-peformance of automated method vs reference standard.

Table 6 Performance (precision, F1 score, F2 score, and F3 score) of automated citation searching vs reference systematic review search strategies

Table 7 Median (IQR) recall, precision, F1 score, F2 score, and F3 score of the best performing automated citation searching runs, by systematic review subsets

Table 8 Median % (IQR) of included articles with Valid IDs extracted from systematic reviews in dataset, and baseline retrievability rate of included articles across both APIs: (OpenAlex, Semantic Scholar)

Figure 4 Recall of automated citation searching for each systematic review against various level of recall (A), and against the baseline retrievability rate of included articles of each systematic review (B).

Rajit et al. supplementary material

File 306.8 KB

Article contents

Automated citation searching in systematic review production: A simulation study

Abstract

Keywords

Information

Highlights

What is already known?

What is new?

Potential impact for Research Synthesis Methods readers

1 Introduction

2 Aims

3 Methods

3.1 Reference systematic review retrieval

3.2 Included article extraction

3.3 Intracluster semantic similarity calculations

3.4 Reference search strategy performance

3.5 Seed article extraction

3.6 Automated citation searching simulation

3.7 Analysis

3.8 Extensions from initial protocol

4 Results

4.1 Dataset

4.2 Original systematic review search strategy performance

4.3 Performance of automated citation searching

4.4 Automated citation searching vs reference search strategies

5 Factors influencing automated citation searching performance

5.1 Significant factors: Study area

5.2 Nonsignificant factors

5.3 Baseline retrievability of included articles

5.4 Automated citation searching performance across benchmarks

6 Discussion

6.1 Principal findings

6.2 Study area and influence on performance

6.3 Limitations of automated citation searching

6.4 Recommendations for current use

6.5 System-level approaches toward improving automated citation searching performance

6.6 Future directions

6.7 Limitations of the study

7 Conclusion

Acknowledgments

Author contributions

Competing interest statement

Data availability statement

Funding statement

Supplementary material

Footnotes

References

Rajit et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests