Hostname: page-component-89b8bd64d-x2lbr Total loading time: 0 Render date: 2026-05-08T02:09:34.100Z Has data issue: false hasContentIssue false

Using supervised machine learning classifiers to estimate likelihood of participating in clinical trials of a de-identified version of ResearchMatch

Published online by Cambridge University Press:  04 September 2020

Janette Vazquez
Affiliation:
Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
Samir Abdelrahman
Affiliation:
Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA Computer Science Department, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
Loretta M. Byrne
Affiliation:
Vanderbilt University, Nashville, TN, USA
Michael Russell
Affiliation:
Vanderbilt University, Nashville, TN, USA
Paul Harris
Affiliation:
Vanderbilt University, Nashville, TN, USA
Julio C. Facelli*
Affiliation:
Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA Center for Clinical and Translational Science, University of Utah, Salt Lake City, UT, USA
*
Address for correspondence: J. C. Facelli, PhD, Department of Biomedical Informatics, University of Utah, 421 Wakara Way, Suite #140, Salt Lake City, UT 84108, USA. Email: julio.facelli@utah.edu
Rights & Permissions [Opens in a new window]

Abstract

Introduction:

Lack of participation in clinical trials (CTs) is a major barrier for the evaluation of new pharmaceuticals and devices. Here we report the results of the analysis of a dataset from ResearchMatch, an online clinical registry, using supervised machine learning approaches and a deep learning approach to discover characteristics of individuals more likely to show an interest in participating in CTs.

Methods:

We trained six supervised machine learning classifiers (Logistic Regression (LR), Decision Tree (DT), Gaussian Naïve Bayes (GNB), K-Nearest Neighbor Classifier (KNC), Adaboost Classifier (ABC) and a Random Forest Classifier (RFC)), as well as a deep learning method, Convolutional Neural Network (CNN), using a dataset of 841,377 instances and 20 features, including demographic data, geographic constraints, medical conditions and ResearchMatch visit history. Our outcome variable consisted of responses showing specific participant interest when presented with specific clinical trial opportunity invitations (‘yes’ or ‘no’). Furthermore, we created four subsets from this dataset based on top self-reported medical conditions and gender, which were separately analysed.

Results:

The deep learning model outperformed the machine learning classifiers, achieving an area under the curve (AUC) of 0.8105.

Conclusions:

The results show sufficient evidence that there are meaningful correlations amongst predictor variables and outcome variable in the datasets analysed using the supervised machine learning classifiers. These approaches show promise in identifying individuals who may be more likely to participate when offered an opportunity for a clinical trial.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Association for Clinical and Translational Science 2020
Figure 0

Fig. 1. Pipeline of method for analysis.

Figure 1

Table 1. Descriptive statistics of ResearchMatch dataset

Figure 2

Table 2. Standardized differences (SMD) and multicollinearity values for ResearchMatch dataset. Standardized differences are comparisons between ‘yes’ and ‘no’ responders

Figure 3

Table 3. Results for ResearchMatch dataset

Supplementary material: File

Vazquez et al. supplementary material

Vazquez et al. supplementary material

Download Vazquez et al. supplementary material(File)
File 15.2 KB