Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-11T08:50:02.491Z Has data issue: false hasContentIssue false

Lessons and tips for designing a machine learning study using EHR data

Published online by Cambridge University Press:  24 July 2020

Jaron Arbet
Affiliation:
Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado-Denver Anschutz Medical Campus, Aurora, CO, USA
Cole Brokamp
Affiliation:
Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH, USA Division of Biostatistics and Epidemiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA
Jareen Meinzen-Derr
Affiliation:
Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH, USA Division of Biostatistics and Epidemiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA
Katy E. Trinkley
Affiliation:
Department of Clinical Pharmacy, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado, Aurora, CO, USA Department of Medicine, School of Medicine, University of Colorado, Aurora, CO, USA
Heidi M. Spratt*
Affiliation:
Department of Preventive Medicine and Population Health, University of Texas Medical Branch, Galveston, TX, USA
*
Address for correspondence: H.M. Spratt, PhD, Department of Preventive Medicine and Population Health, University of Texas Medical Branch, 301 University Blvd. Route 1148, Galveston, TX 77555-1148, USA. Email: hespratt@utmb.edu
Rights & Permissions [Opens in a new window]

Abstract

Machine learning (ML) provides the ability to examine massive datasets and uncover patterns within data without relying on a priori assumptions such as specific variable associations, linearity in relationships, or prespecified statistical interactions. However, the application of ML to healthcare data has been met with mixed results, especially when using administrative datasets such as the electronic health record. The black box nature of many ML algorithms contributes to an erroneous assumption that these algorithms can overcome major data issues inherent in large administrative healthcare data. As with other research endeavors, good data and analytic design is crucial to ML-based studies. In this paper, we will provide an overview of common misconceptions for ML, the corresponding truths, and suggestions for incorporating these methods into healthcare research while maintaining a sound study design.

Information

Type
Review Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Association for Clinical and Translational Science 2020
Figure 0

Fig. 1. Illustration of the iterative machine learning process.

Figure 1

Table 1. Metrics for evaluating prediction accuracy for various types of outcomes

Figure 2

Table 2. General properties of different machine learning models (adapted from Kuhn [12] and Hastie et al. [2]): ✓ = good, = fair, × = poor