We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Single-arm studies, particularly single-arm trials (SATs), are increasingly being used in submissions for marketing authorization and health technology assessment. As reviewers of evidence, we sought to better understand the validity of SATs, compared with observational single-arm studies (case series), and how to assess them in our reviews.
Methods
We conducted a highly pragmatic literature review to create a convenience sample of recent systematic reviews published from January to July 2023 to establish the following: (i) what single-arm study designs are included; (ii) what quality assessment tools are used; and (iii) whether there is a difference in effect size and variability among different study designs. A single reviewer identified reviews of interventions that included single-arm studies and extracted information on the numbers of included SATs and case series, and the quality assessment tools used. Any misclassifications by review authors were identified. For meta-analyses, outcome data were extracted and a subgroup analysis comparing SATs and case series was conducted.
Results
Work is still underway to complete this investigation. So far, it appears that a large proportion of systematic reviews misclassify SATs and case series studies and few use appropriate quality assessment tools. There is not yet any evidence of a systematic difference between SATs and case series in terms of effect size.
Conclusions
Findings suggest that there is poor understanding of SATs in the review community. There are limited specific quality assessment tools for SATs and review authors frequently use inappropriate tools to assess them. More research is likely to be needed to investigate the relative validity of SATs and single-arm observational studies.
As the most internally rigorous design, randomized controlled trials (RCTs) are the gold standard for assessing the efficacy and safety profile of interventions. Increasingly, health technology assessment (HTA) considers evidence from non-randomized studies. Guidance recommends synthesizing different study designs separately due to their different inherent biases/limitations. But when authors or reviewers misclassify studies, this could affect which studies are included and therefore have an impact on review results.
Methods
We are conducting a methods project to (i) identify a clear study design classification system, (ii) explore whether its use produces consistent study design categorizations among reviewers, and (iii) iteratively improve the classification system. We performed a pragmatic web-based search for study design categorization tools and used the resulting schemas to develop a clear algorithm for use by reviewers of all levels of experience, specifically in reviews of treatment interventions. Next, we tested tool consistency and user experience by web-based survey in a small internal sample of reviewers, each independently using the system to categorize 18 published studies.
Results
A median of seven reviewers (range four to eight) categorized each study. Rater agreement using the chart varied widely, with 100 percent agreement on the designs of three studies (17%), and at least 75 percent of reviewers agreeing on one design for nine studies (50%). The most common agreement was reached on RCTs and non-randomized controlled trials. The most common sources of disagreement were between different types of cohort studies and between case series and controlled cohort studies, largely due to inconsistent reporting. We also identified several improvements: the wording of prompt questions, the ordering of designs, and the addition of new elements.
Conclusions
The classification system as initially designed led to too much variation in study design categorization to be useful. Consequently, we present a revised version that we now aim to evaluate in a larger sample of reviewers. Further research will also investigate whether using the tool would change the results of systematic reviews, using a small sample of published reviews.
While systematic reviews (SRs) are regarded as the gold standard in healthcare evidence reviewing (and a requirement of many health technology assessments [(HTAs)]), other types of review also play an important role throughout a product’s lifecycle. Drawing on more than thirty years’ experience in conducting reviews, we present key points to consider when deciding which review type might be required.
Methods
SRs are recommended when a comprehensive search and synthesis approach is required, for example HTAs. They have highly structured methods, emphasizing bias minimization, transparency, and replicability. “Rapid,” “pragmatic,” or “targeted” reviews are increasingly popular due to their accelerated timelines and reduced costs, with methodological shortcuts possible at various stages. Scoping reviews explore what is known about a topic and typically have a broad research question. “Reviews of reviews” or “overviews” identify existing SRs on an established topic. Finally, “living reviews” follow the same process as an SR or rapid review but incorporate new evidence on a continual or regular basis.
Results
Rapid reviews may be appropriate when flexibility exists regarding the scope and review methods. Any limitations due to methodological shortcuts must be acknowledged in a transparent manner. Scoping reviews are useful for pioneering research ahead of an SR, or early in a product’s development phase, when an overall understanding of the evidence base is required. Reviews of reviews are particularly useful when the size of the primary study literature means that a review of primary studies would be unfeasible. Living reviews are best suited to topics where the evidence base is changing rapidly, or the best information is needed quickly.
Conclusions
When considering conducting or commissioning a review, organizations should consider the intended audience for the review, the resources, time, and budget available, and the size of the existing literature. Although SRs remain the gold standard, a rapid review, scoping review, or review of reviews may offer a more suitable way to approach a given research question.
Conducting a systematic review (SR) of clinical trials is labor-intensive and expensive. However, existing open-source content can be used to develop custom machine learning tools suited to the workflow of individual organizations. This case study details the potential of a bespoke tool developed by York Health Economics Consortium (YHEC) for reducing the time and cost involved in producing an SR.
Methods
RESbot is a flexible, stand-alone machine learning tool created using an extensively tested open-source dataset developed by Cochrane. The tool identifies randomized controlled trials (RCTs) from a large corpus of records. It has a user interface and inputs/outputs to fit into the company’s existing workflow at any stage. RESbot has two settings. The “sensitive” setting identifies a higher number of possible RCTs with a lower risk of missing eligible studies, while the “precise” setting is more focused. For both settings, we estimated the reduction in resources required for record screening in two examples of RCT-only reviews.
Results
Scoping searches in MEDLINE were conducted for SRs of RCTs in femoropopliteal artery diseases (FAD) and postpartum depression (PD). The results were run through RESbot. For the FAD SR, 1,444 references were retrieved, with the sensitive and precise RESbot settings reducing the record set by 38 percent and 64 percent, respectively. For the PD SR, a record set of 2,153 records was reduced by 25 percent and 41 percent, respectively. Resource savings offered by RESbot vary depending on subject but may reduce the time taken to screen records by up to 64 percent, with a subsequent reduction in cost to the organization commissioning the SR.
Conclusions
The use of bespoke machine learning tools in SR production has the potential to reduce the time and staff costs involved in producing a review. This case study tested the effect on a small number of records, but for larger reviews retrieving tens of thousands of records, reductions in time and costs can be very significant.
Systematic reviews are important for informing decision-making and primary research, but they can be time consuming and costly. With the advent of machine learning, there is an opportunity to accelerate the review process in study screening. We aimed to understand the literature to make decisions about the use of machine learning for screening in our review workflow.
Methods
A pragmatic literature review of PubMed to obtain studies evaluating the accuracy of publicly available machine learning screening tools. A single reviewer used ‘snowballing’ searches to identify studies reporting accuracy data and extracted the sensitivity (ability to correctly identify included studies for a review) and specificity, or workload saved (ability to correctly exclude irrelevant studies).
Results
Ten tools (AbstractR, ASReview Lab, Cochrane RCT classifier, Concept encoder, Dpedia, DistillerAI, Rayyan, Research Screener, Robot Analyst, SWIFT-active screener) were evaluated in a total of 16 studies. Fourteen studies were single arm where, although compared with a reference standard (predominantly single reviewer screening), there was no other comparator. Two studies were comparative, where tools were compared with other tools as well as a reference standard. All tools ranked records by probability of inclusion and either (i) applied a cut-point to exclude records or (ii) were used to rank and re-rank records during screening iterations, with screening continuing until most relevant records were obtained. The accuracy of tools varied widely between different studies and review projects. When used in method (ii), at 95 percent to 100 percent sensitivity, tools achieved workload savings of between 7 percent and 99 percent. It was unclear whether evaluations were conducted independent of tool developers.
Conclusions
Evaluations suggest the potential for tools to correctly classify studies in screening. However, conclusions are limited since (i) tool accuracy is generally not compared with dual reviewer screening and (ii) the literature lacks comparative studies and, because of between-study heterogeneity, it is not possible to robustly determine the accuracy of tools compared with each other. Independent evaluations are needed.
To identify which international health technology assessment (HTA) agencies are undertaking evaluations of medical tests, summarize commonalities and differences in methodological approach, and highlight examples of good practice.
Methods
A methodological review incorporating: systematic identification of HTA guidance documents mentioning evaluation of tests; identification of key contributing organizations and abstraction of approaches to all essential HTA steps; summary of similarities and differences between organizations; and identification of important emergent themes which define the current state of the art and frontiers where further development is needed.
Results
Seven key organizations were identified from 216 screened. The main themes were: elucidation of claims of test benefits; attitude to direct and indirect evidence of clinical effectiveness (including evidence linkage); searching; quality assessment; and health economic evaluation. With the exception of dealing with test accuracy data, approaches were largely based on general approaches to HTA with few test-specific modifications. Elucidation of test claims and attitude to direct and indirect evidence are where we identified the biggest dissimilarities in approach.
Conclusions
There is consensus on some aspects of HTA of tests, such as dealing with test accuracy, and examples of good practice which HTA organizations new to test evaluation can emulate. The focus on test accuracy contrasts with universal acknowledgment that it is not a sufficient evidence base for test evaluation. There are frontiers where methodological development is urgently required, notably integrating direct and indirect evidence and standardizing approaches to evidence linkage.
Recommend this
Email your librarian or administrator to recommend adding this to your organisation's collection.