Hostname: page-component-89b8bd64d-shngb Total loading time: 0 Render date: 2026-05-08T03:35:14.744Z Has data issue: false hasContentIssue false

Enhancing selection of alcohol consumption-associated genes by random forest

Published online by Cambridge University Press:  12 April 2024

Chenglin Lyu
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA 02118, USA
Roby Joehanes
Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Tianxiao Huan
Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Daniel Levy
Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Yi Li
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Mengyao Wang
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Xue Liu
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Chunyu Liu*
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Jiantao Ma*
Affiliation:
Nutrition Epidemiology and Data Science, Friedman School of Nutrition Science and Policy, Tufts University, Boston, MA 02111, USA
*
*Corresponding authors: Chunyu Liu, email: liuc@bu.edu; Jiantao Ma, email: jiantao.ma@tufts.edu
*Corresponding authors: Chunyu Liu, email: liuc@bu.edu; Jiantao Ma, email: jiantao.ma@tufts.edu
Rights & Permissions [Opens in a new window]

Abstract

Machine learning methods have been used in identifying omics markers for a variety of phenotypes. We aimed to examine whether a supervised machine learning algorithm can improve identification of alcohol-associated transcriptomic markers. In this study, we analysed array-based, whole-blood derived expression data for 17 873 gene transcripts in 5508 Framingham Heart Study participants. By using the Boruta algorithm, a supervised random forest (RF)-based feature selection method, we selected twenty-five alcohol-associated transcripts. In a testing set (30 % of entire study participants), AUC (area under the receiver operating characteristics curve) of these twenty-five transcripts were 0·73, 0·69 and 0·66 for non-drinkers v. moderate drinkers, non-drinkers v. heavy drinkers and moderate drinkers v. heavy drinkers, respectively. The AUC of the selected transcripts by the Boruta method were comparable to those identified using conventional linear regression models, for example, AUC of 1958 transcripts identified by conventional linear regression models (false discovery rate < 0·2) were 0·74, 0·66 and 0·65, respectively. With Bonferroni correction for the twenty-five Boruta method-selected transcripts and three CVD risk factors (i.e. at P < 6·7e-4), we observed thirteen transcripts were associated with obesity, three transcripts with type 2 diabetes and one transcript with hypertension. For example, we observed that alcohol consumption was inversely associated with the expression of DOCK4, IL4R, and SORT1, and DOCK4 and SORT1 were positively associated with obesity, and IL4R was inversely associated with hypertension. In conclusion, using a supervised machine learning method, the RF-based Boruta algorithm, we identified novel alcohol-associated gene transcripts.

Information

Type
Research Article
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of The Nutrition Society
Figure 0

Fig. 1. Study flow chart. FDR, false discovery rate; FHS, Framingham Heart Study; MSigDB, Molecular Signatures Database.

Figure 1

Table 1. Participant characteristics

Figure 2

Table 2. Boruta algorithm-selected genes

Figure 3

Fig. 2. ROC of selected predictors. (1) Boruta method was based on the twenty-five Boruta method-selected transcripts; (2) 1958 transcripts and (3) twenty-five transcripts were from alcohol-gene expression analyses using conventional linear regression (see ref. 9); (4) 144 CpG were from meta-analysis of alcohol-associated DNA methylation markers (see ref. 21); (5) combined predictors from sets 1, 3 and 4. ROC, receiver operating characteristics.

Figure 4

Table 3. Cross-sectional analysis of Bruta method-selected genes with CVD risk factors

Supplementary material: File

Lyu et al. supplementary material

Lyu et al. supplementary material
Download Lyu et al. supplementary material(File)
File 815.2 KB