Hostname: page-component-89b8bd64d-ksp62 Total loading time: 0 Render date: 2026-05-07T14:57:31.737Z Has data issue: false hasContentIssue false

A comparison of statistical and machine-learning techniques in evaluating the association between dietary patterns and 10-year cardiometabolic risk (2002–2012): the ATTICA study

Published online by Cambridge University Press:  23 May 2018

Dimitris Panaretos
Affiliation:
Department of Nutrition and Dietetics, School of Health Science and Education, Harokopio University, 17676 Athens, Greece
Efi Koloverou
Affiliation:
Department of Nutrition and Dietetics, School of Health Science and Education, Harokopio University, 17676 Athens, Greece
Alexandros C. Dimopoulos
Affiliation:
Department of Informatics & Telematics, School of Digital Technology, Harokopio University, 17676 Athens, Greece
Georgia-Maria Kouli
Affiliation:
Department of Nutrition and Dietetics, School of Health Science and Education, Harokopio University, 17676 Athens, Greece
Malvina Vamvakari
Affiliation:
Department of Informatics & Telematics, School of Digital Technology, Harokopio University, 17676 Athens, Greece
George Tzavelas
Affiliation:
Department of Statistics and Insurance Science, University of Piraeus, 18534 Piraeus, Greece
Christos Pitsavos
Affiliation:
School of Medicine, University of Athens, 11527 Athens, Greece
Demosthenes B. Panagiotakos*
Affiliation:
Department of Nutrition and Dietetics, School of Health Science and Education, Harokopio University, 17676 Athens, Greece
*
*Corresponding author: D. B. Panagiotakos, email d.b.panagiotakos@usa.net
Rights & Permissions [Opens in a new window]

Abstract

Statistical methods are usually applied in examining diet–disease associations, whereas factor analysis is commonly used for dietary pattern recognition. Recently, machine learning (ML) has been also proposed as an alternative technique in health classification. In this work, the predictive accuracy of statistical v. ML methodologies as regards the association of dietary patterns on CVD risk was tested. During 2001–2002, 3042 men and women (45 (sd 14) years) were enrolled in the ATTICA study. In 2011–2012, the 10-year CVD follow-up was performed among 2020 participants. Item Response Theory was applied to create a metric of combined 10-year cardiometabolic risk, the ‘Cardiometabolic Health Score’, that incorporated incidence of CVD, diabetes, hypertension and hypercholesterolaemia. Factor analysis was performed to extract dietary patterns, on the basis of either foods or nutrients consumed; linear regression analysis was used to assess their association with the cardiometabolic score. Two ML techniques (k-nearest-neighbor’s algorithm and random-forests decision tree) were applied to evaluate participants’ health based on dietary information. Factor analysis revealed five and three factors from foods and nutrients, respectively, explaining 54 and 65 % of the total variation in intake. Nutrient and food pattern regression models showed similar accuracy in correctly classifying an individual according to the cardiometabolic risk (R2=9·6 % and R2=8·3 %, respectively). ML techniques were superior compared with linear regression in correct classification of the individuals according to the Health Score (accuracy approximately 38 v. 6 %, respectively), whereas the two ML methods showed equal classification ability. Conclusively, ML methods could be a valuable tool in the field of nutritional epidemiology, leading to more accurate disease-risk evaluation.

Information

Type
Full Papers
Copyright
© The Authors 2018 
Figure 0

Table 1 Baseline characteristics and 10-year incidence of CVD, hypertension, diabetes mellitus and hypercholesterolaemia of the ATTICA study’s participants according to the Cardiometabolic Health Score tertiles*(Numbers and percentages; mean values and standard deviations)

Figure 1

Table 2 Factor loadings of foods and nutrients consumed by the ATTICA study participants (n 2020) as derived from the factor analysis using the principal component method

Figure 2

Table 3 Results from multiple linear regression models that evaluated the associations between food and nutrient factors and the 10-year Cardiometabolic Health Score (the higher the score the better the health status), among the 2020 ATTICA study participants

Figure 3

Fig. 1 Accuracy for the two different classifiers presented (k-NN and random forest (RF)), using as input for the model construction (a) the food factors and (b) the nutrition factors. ML, machine learning.

Figure 4

Fig. 2 A typical tree created by the random forest (RF) method for the model using the food factors. RF creates an ensemble of trees and each one has one vote and the model decides on the classification of each sample using the majority rule.