Hostname: page-component-6766d58669-88psn Total loading time: 0 Render date: 2026-05-23T01:42:00.072Z Has data issue: false hasContentIssue false

Identification of newborns at risk for autism using electronic medical records and machine learning

Published online by Cambridge University Press:  26 February 2020

Rayees Rahman
Affiliation:
Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Arad Kodesh
Affiliation:
Department of Mental Health, Meuhedet Health Services, Tel Aviv, Israel Department of Community Health, University of Haifa, Haifa, Israel
Stephen Z. Levine
Affiliation:
Department of Community Health, University of Haifa, Haifa, Israel
Sven Sandin
Affiliation:
Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA Seaver Center for Autism Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Abraham Reichenberg*
Affiliation:
Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA Seaver Center for Autism Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA MINDICH Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Avner Schlessinger*
Affiliation:
Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
*
Abraham Reichenberg, E-mail: avi.reichenberg@mssm.edu
Avner Schlessinger, E-mail: avner.schlessinger@mssm.edu

Abstract

Background.

Current approaches for early identification of individuals at high risk for autism spectrum disorder (ASD) in the general population are limited, and most ASD patients are not identified until after the age of 4. This is despite substantial evidence suggesting that early diagnosis and intervention improves developmental course and outcome. The aim of the current study was to test the ability of machine learning (ML) models applied to electronic medical records (EMRs) to predict ASD early in life, in a general population sample.

Methods.

We used EMR data from a single Israeli Health Maintenance Organization, including EMR information for parents of 1,397 ASD children (ICD-9/10) and 94,741 non-ASD children born between January 1st, 1997 and December 31st, 2008. Routinely available parental sociodemographic information, parental medical histories, and prescribed medications data were used to generate features to train various ML algorithms, including multivariate logistic regression, artificial neural networks, and random forest. Prediction performance was evaluated with 10-fold cross-validation by computing the area under the receiver operating characteristic curve (AUC; C-statistic), sensitivity, specificity, accuracy, false positive rate, and precision (positive predictive value [PPV]).

Results.

All ML models tested had similar performance. The average performance across all models had C-statistic of 0.709, sensitivity of 29.93%, specificity of 98.18%, accuracy of 95.62%, false positive rate of 1.81%, and PPV of 43.35% for predicting ASD in this dataset.

Conclusions.

We conclude that ML algorithms combined with EMR capture early life ASD risk as well as reveal previously unknown features to be associated with ASD-risk. Such approaches may be able to enhance the ability for accurate and efficient early detection of ASD in large populations of children.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s) 2020
Figure 0

Figure 1. Workflow used to build the machine learning model of autism spectrum disorder (ASD) incidence. To evaluate the utility of electronic medical record (EMR) and machine learning for predicting the risk of having a child with ASD, we developed a comprehensive dataset. (A) For each mother–father pair, the parental age difference, number of unique medications either parent has taken, the socioeconomic status, as well as the proportion of drugs, by level 2 Anatomic Therapeutic Classification (ATC) code, taken by the parent were used for further analysis. (B) Workflow of performing 10-fold cross-validation to evaluate model performance. First, the data were partitioned into ASD and non-ASD cases, where 80% of the data were randomly sampled as training set, and 20% were withheld as testing set. The training set was then combined and the synthetic minority oversampling technique (SMOTE) was used to generate synthetic records of ASD cases. A multilayer perceptron (MLP), also known as feedforward neural network, logistic regression, and random forest models were trained using the oversampled training data. They were then evaluated on the testing data based on sensitivity, precision, sensitivity, false positive rate, and area under the ROC curve (AUC; C-statistic). Since the testing data did not have synthetic cases, the model performance is indicative of performance of real data. This process was repeated 10 times and average model performance was reported.

Figure 1

Figure 2. Performance of machine learning-based autism spectrum disorder (ASD)-risk predictor. A balanced dataset was generated to train various algorithms to predict the probability of an ASD child from the electronic medical record of the parents. (A) Receiver operator characteristic (ROC) curves for all methods tested: logistic regression, random forest, and MLP. (B) Boxplot of importance values of each feature in the random forest model after 10-fold cross-validation (10× CV). Importance of a feature is defined as the mean decrease in Gini coefficient when training a model. Level 2 Anatomic Therapeutic Classification (ATC) codes are represented by an alphanumeric three-letter code.

Figure 2

Table 1. Performance of classifiers after 10-fold cross-validation

Figure 3

Figure 3. Sensitivity analysis of the generated machine learning models. (A) Receiver operator characteristic (ROC) curves for all methods tested with “missing parental information” label included. (B) ROC curves of models generated when all parental medication data are removed. (C) ROC curves of models generated when all maternal medication data are removed.

Figure 4

Table 2. Performance of classifiers in sensitivity analysis

Supplementary material: File

Rahman et al. supplementary material

Rahman et al. supplementary material
Download Rahman et al. supplementary material(File)
File 27.3 KB
Submit a response

Comments

No Comments have been published for this article.