Hostname: page-component-848d4c4894-m9kch Total loading time: 0 Render date: 2024-05-14T04:10:49.542Z Has data issue: false hasContentIssue false

Enhancing selection of alcohol consumption-associated genes by random forest

Published online by Cambridge University Press:  12 April 2024

Chenglin Lyu
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA 02118, USA
Roby Joehanes
Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Tianxiao Huan
Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Daniel Levy
Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Yi Li
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Mengyao Wang
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Xue Liu
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Chunyu Liu*
Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Jiantao Ma*
Affiliation:
Nutrition Epidemiology and Data Science, Friedman School of Nutrition Science and Policy, Tufts University, Boston, MA 02111, USA
*
*Corresponding authors: Chunyu Liu, email: liuc@bu.edu; Jiantao Ma, email: jiantao.ma@tufts.edu
*Corresponding authors: Chunyu Liu, email: liuc@bu.edu; Jiantao Ma, email: jiantao.ma@tufts.edu

Abstract

Machine learning methods have been used in identifying omics markers for a variety of phenotypes. We aimed to examine whether a supervised machine learning algorithm can improve identification of alcohol-associated transcriptomic markers. In this study, we analysed array-based, whole-blood derived expression data for 17 873 gene transcripts in 5508 Framingham Heart Study participants. By using the Boruta algorithm, a supervised random forest (RF)-based feature selection method, we selected twenty-five alcohol-associated transcripts. In a testing set (30 % of entire study participants), AUC (area under the receiver operating characteristics curve) of these twenty-five transcripts were 0·73, 0·69 and 0·66 for non-drinkers v. moderate drinkers, non-drinkers v. heavy drinkers and moderate drinkers v. heavy drinkers, respectively. The AUC of the selected transcripts by the Boruta method were comparable to those identified using conventional linear regression models, for example, AUC of 1958 transcripts identified by conventional linear regression models (false discovery rate < 0·2) were 0·74, 0·66 and 0·65, respectively. With Bonferroni correction for the twenty-five Boruta method-selected transcripts and three CVD risk factors (i.e. at P < 6·7e-4), we observed thirteen transcripts were associated with obesity, three transcripts with type 2 diabetes and one transcript with hypertension. For example, we observed that alcohol consumption was inversely associated with the expression of DOCK4, IL4R, and SORT1, and DOCK4 and SORT1 were positively associated with obesity, and IL4R was inversely associated with hypertension. In conclusion, using a supervised machine learning method, the RF-based Boruta algorithm, we identified novel alcohol-associated gene transcripts.

Type
Research Article
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of The Nutrition Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

These authors contributed equally to this work

References

Emanuele, NV, Swade, TF & Emanuele, MA (1998) Consequences of alcohol use in diabetics. Alcohol Health Res World 22, 211219.Google ScholarPubMed
Chait, A, Mancini, M, February, AW, et al. (1972) Clinical and metabolic study of alcoholic hyperlipidaemia. Lancet 2, 6264.CrossRefGoogle ScholarPubMed
Collaborators GBDA (2018) Alcohol use and burden for 195 countries and territories, 1990–2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet 392, 10151035.CrossRefGoogle Scholar
Chikritzhs, TN, Naimi, TS, Stockwell, TR, et al. (2015) Mendelian randomisation meta-analysis sheds doubt on protective associations between ‘moderate’ alcohol consumption and coronary heart disease. Evid Based Med 20, 38.CrossRefGoogle Scholar
Stockwell, T, Zhao, J, Panwar, S, et al. (2016) Do ‘Moderate’ drinkers have reduced mortality risk? A systematic review and meta-analysis of alcohol consumption and all-cause mortality. J Stud Alcohol Drugs 77, 185198.CrossRefGoogle Scholar
Huan, T, Esko, T, Peters, MJ, et al. (2015) A meta-analysis of gene expression signatures of blood pressure and hypertension. PLoS Genet 11, e1005035.CrossRefGoogle ScholarPubMed
Yao, C, Chen, BH, Joehanes, R, et al. (2015) Integromic analysis of genetic variation and gene expression identifies networks for cardiovascular disease phenotypes. Circulation 131, 536549.CrossRefGoogle ScholarPubMed
Benton, MC, Lea, RA, Macartney-Coxson, D, et al. (2013) Mapping eQTLs in the Norfolk Island genetic isolate identifies candidate genes for CVD risk traits. Am J Hum Genet 93, 10871099.CrossRefGoogle ScholarPubMed
Ma, J, Huang, A, Yan, K, et al. (2023) Blood transcriptomic biomarkers of alcohol consumption and cardiovascular disease risk factors: the Framingham Heart Study. Hum Mol Genet 32, 649658.CrossRefGoogle ScholarPubMed
Luo, J, Wu, M, Gopukumar, D, et al. (2016) Big Data application in biomedical research and health care: a literature review. Biomed Inform Insights 8, 110.CrossRefGoogle ScholarPubMed
Breiman, L (2001) Random forests. Machine Learning 45, 532.CrossRefGoogle Scholar
Hu, J & Szymczak, S (2023) A review on longitudinal data analysis with random forest. Brief Bioinform 24, bbad002.CrossRefGoogle ScholarPubMed
Degenhardt, F, Seifert, S & Szymczak, S (2019) Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform 20, 492503.CrossRefGoogle ScholarPubMed
Cammarota, C & Pinto, A (2021) Variable selection and importance in presence of high collinearity: an application to the prediction of lean body mass from multi-frequency bioelectrical impedance. J Appl Stat 48, 16441658.CrossRefGoogle Scholar
Swan, AL, Mobasheri, A, Allaway, D, et al. (2013) Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS 17, 595610.CrossRefGoogle ScholarPubMed
Kursa, M, Jankowski, A & Rudnicki, W (2010) Boruta – a system for feature selection. Fundam Inform 101, 271285.CrossRefGoogle Scholar
Acharjee, A, Larkman, J, Xu, Y, et al. (2020) A random forest based biomarker discovery and power analysis framework for diagnostics research. BMC Med Genomics 13, 178.CrossRefGoogle ScholarPubMed
Liu, C, Ackerman, HH & Carulli, JP (2011) A genome-wide screen of gene–gene interactions for rheumatoid arthritis susceptibility. Hum Genet 129, 473485.CrossRefGoogle ScholarPubMed
Steyerberg, EW, van der Ploeg, T & Van Calster, B (2014) Risk prediction with machine learning and regression methods. Biom J 56, 601606.CrossRefGoogle ScholarPubMed
Polewko-Klim, A, Lesinski, W, Golinska, AK, et al. (2020) Sensitivity analysis based on the random forest machine learning algorithm identifies candidate genes for regulation of innate and adaptive immune response of chicken. Poult Sci 99, 63416354.CrossRefGoogle ScholarPubMed
Feinleib, M, Kannel, WB, Garrison, RJ, et al. (1975) The Framingham Offspring Study. Design and preliminary data. Prev Med 4, 518525.CrossRefGoogle ScholarPubMed
Splansky, GL, Corey, D, Yang, Q, et al. (2007) The third generation cohort of the National Heart, Lung, and Blood Institute’s Framingham Heart Study: design, recruitment, and initial examination. Am J Epidemiol 165, 13281335.CrossRefGoogle ScholarPubMed
Liu, C, Marioni, RE, Hedman, AK, et al. (2018) A DNA methylation biomarker of alcohol consumption. Mol Psychiatry 23, 422433.CrossRefGoogle ScholarPubMed
Joehanes, R, Ying, S, Huan, T, et al. (2013) Gene expression signatures of coronary heart disease. Arterioscler Thromb Vasc Biol 33, 14181426.CrossRefGoogle ScholarPubMed
Sun, X, Ho, JE, Gao, H, et al. (2021) Associations of alcohol consumption with cardiovascular disease-related proteomic biomarkers: the Framingham Heart Study. J Nutr 151, 25742582.CrossRefGoogle ScholarPubMed
Czuriga-Kovacs, KR, Czuriga, D, Kardos, L, et al. (2019) Reply to letter: reversibility of hypertension-induced subclinical vascular changes: do the new ACC/AHA 2017 blood pressure guidelines and heart rate changes make a difference? J Clin Hypertens (Greenwich) 21, 12431244.CrossRefGoogle Scholar
Kursa, M & Rudnicki, W (2010) Feature selection with the Boruta package. J Stat Software 36, 13.CrossRefGoogle Scholar
Martens, M, Ammar, A, Riutta, A, et al. (2021) WikiPathways: connecting communities. Nucleic Acids Res 49, D613D21.CrossRefGoogle ScholarPubMed
Mootha, VK, Lindgren, CM, Eriksson, KF, et al. (2003) PGC-1-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34, 267273.CrossRefGoogle Scholar
Subramanian, A, Tamayo, P, Mootha, VK, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 1554515550.CrossRefGoogle ScholarPubMed
Thomas, PD, Ebert, D, Muruganujan, A, et al. (2022) PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci 31, 822.CrossRefGoogle Scholar
Liaw, A & Wiener, M (2002) Classification and regression by randomForest. R News 2, 1822.Google Scholar
Robin, X, Turck, N, Hainard, A, et al. (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf 12, 77.CrossRefGoogle Scholar
Kursa, MB (2014) Robustness of Random Forest-based gene selection methods. BMC Bioinf 15, 8.CrossRefGoogle ScholarPubMed
Shen, J, Qi, L, Zou, Z, et al. (2020) Identification of a novel gene signature for the prediction of recurrence in HCC patients by machine learning of genome-wide databases. Sci Rep 10, 4435.CrossRefGoogle Scholar
Lin, MS, Jo, SY, Luebeck, J, et al. (2023) Transcriptional immune suppression and upregulation of double stranded DNA damage and repair repertoires in ecDNA-containing tumors. bioRxivCrossRefGoogle Scholar
Long, NP, Park, S, Anh, NH, et al. (2019) High-throughput omics and statistical learning integration for the discovery and validation of novel diagnostic signatures in colorectal cancer. Int J Mol Sci 20, 296.CrossRefGoogle ScholarPubMed
Dessie, EY, Gautam, Y, Ding, L, et al. (2023) Development and validation of asthma risk prediction models using co-expression gene modules and machine learning methods. Sci Rep 13, 11279.CrossRefGoogle Scholar
Yengo, L, Vedantam, S, Marouli, E, et al. (2022) A saturated map of common genetic variants associated with human height. Nature 610, 704712.CrossRefGoogle ScholarPubMed
Singh, A, Shannon, CP, Gautier, B, et al. (2019) DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics 35, 30553062.CrossRefGoogle ScholarPubMed
Wekesa, JS & Kimwele, M (2023) A review of multi-omics data integration through deep learning approaches for disease diagnosis, prognosis, and treatment. Front Genet 14, 1199087.CrossRefGoogle ScholarPubMed
Supplementary material: File

Lyu et al. supplementary material

Lyu et al. supplementary material
Download Lyu et al. supplementary material(File)
File 815.2 KB