Hostname: page-component-848d4c4894-75dct Total loading time: 0 Render date: 2024-06-08T23:12:40.189Z Has data issue: false hasContentIssue false

PD19 Machine Learning Modelling For Clinical Trial Design Using The National Institute for Health and Care Research Innovation Observatory’s ScanMedicine Database

Published online by Cambridge University Press:  23 December 2022

Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.
Introduction

Clinical trials that fail prematurely due to poor design are a waste of resources and deprives us of data for evaluating potentially effective interventions. This study used machine learning modelling to predict clinical trials’ success or failure and to understand feature contributions driving this result. Features to power the modelling were engineered using data collected from the National Institute for Health and Care Research Innovation Observatory’s ScanMedicine database.

Methods

Using ScanMedicine, a large dataset containing 641,079 clinical trial records from 11 global clinical trial registries, was extracted. Sixteen features were generated from the data based on fields relating to trial design and eligibility. Trials were labeled positive if they were completed (or target recruitment was achieved) or negative if terminated/withdrawn (or target recruitment was not achieved). To achieve optimal performance, phase-specific datasets were generated, and we focused on a subsample of Phase 2 trials (n=70,167). Ensemble models using bagging and boosting algorithms, including balanced random forest and extreme gradient boosting classifiers were used for training and evaluating predictive performance. Shapley Additive Explanations was used to explain the output of the best performing model and calculate feature contributions for individual studies.

Results

We achieved a weighted F1-score of 0.88, Receiver Operator Characteristic Area under the Curve score of 0.75, and balanced accuracy of 0.75 on the test set with the xgBoost model. This result shows that the model can successfully distinguish between classes to predict if a trial will succeed or fail and subsequently output the features driving this outcome. The number of primary outcomes, whether the study was randomized, target sample size and number of exclusion criteria were the most important features affecting the model’s prediction.

Conclusions

This study is the first to use predictive modelling on a large sample of clinical trial data obtained from 11 international trial registries. The prediction outcomes achieved by our novel approach, which uses phase-specific trained models, outperforms previous modelling in this space.

Type
Poster Debate
Copyright
© The Author(s), 2022. Published by Cambridge University Press