Hostname: page-component-89b8bd64d-72crv Total loading time: 0 Render date: 2026-05-07T18:51:37.550Z Has data issue: false hasContentIssue false

A comparison of heuristic and model-based clustering methods for dietary pattern analysis

Published online by Cambridge University Press:  20 January 2015

Benjamin Greve
Affiliation:
Leibniz-Institute for Prevention Research and Epidemiology – BIPS GmbH, Achterstrasse 30, 28359 Bremen, Germany Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany
Iris Pigeot
Affiliation:
Leibniz-Institute for Prevention Research and Epidemiology – BIPS GmbH, Achterstrasse 30, 28359 Bremen, Germany Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany
Inge Huybrechts
Affiliation:
Department of Public Health, Ghent University, Ghent, Belgium Dietary Exposure Assessment Group (DEX), International Agency for Research on Cancer, Lyon, France
Valeria Pala
Affiliation:
Department of Preventive & Predictive Medicine, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
Claudia Börnhorst*
Affiliation:
Leibniz-Institute for Prevention Research and Epidemiology – BIPS GmbH, Achterstrasse 30, 28359 Bremen, Germany
*
*Corresponding author: Email boern@bips.uni-bremen.de
Rights & Permissions [Opens in a new window]

Abstract

Objective

Cluster analysis is widely applied to identify dietary patterns. A new method based on Gaussian mixture models (GMM) seems to be more flexible compared with the commonly applied k-means and Ward’s method. In the present paper, these clustering approaches are compared to find the most appropriate one for clustering dietary data.

Design

The clustering methods were applied to simulated data sets with different cluster structures to compare their performance knowing the true cluster membership of observations. Furthermore, the three methods were applied to FFQ data assessed in 1791 children participating in the IDEFICS (Identification and Prevention of Dietary- and Lifestyle-Induced Health Effects in Children and Infants) Study to explore their performance in practice.

Results

The GMM outperformed the other methods in the simulation study in 72 % up to 100 % of cases, depending on the simulated cluster structure. Comparing the computationally less complex k-means and Ward’s methods, the performance of k-means was better in 64–100 % of cases. Applied to real data, all methods identified three similar dietary patterns which may be roughly characterized as a ‘non-processed’ cluster with a high consumption of fruits, vegetables and wholemeal bread, a ‘balanced’ cluster with only slight preferences of single foods and a ‘junk food’ cluster.

Conclusions

The simulation study suggests that clustering via GMM should be preferred due to its higher flexibility regarding cluster volume, shape and orientation. The k-means seems to be a good alternative, being easier to use while giving similar results when applied to real data.

Information

Type
Research Papers
Copyright
Copyright © The Authors 2015 
Figure 0

Fig. 1 A simulated two-dimensional data set with three clusters (represented by ●, □ and +) of variable volumes and shapes (‘true clusters’, a) as well as the clustering solutions obtained with the k-means algorithm (b), Ward’s method (c) and a Gaussian mixture model (GMM, d)

Figure 1

Fig. 2 Exemplary two-dimensional data set for each of the three cluster geometries: (a) spherical, equal volume; (b) variable volume, shape and orientation; and (c) cube-shaped, equal volume and orientation (●, □, ▲ and + represent different clusters)

Figure 2

Table 1 Comparison of pairs of clustering methods by how often each one achieved a higher agreement with the true cluster structure, based on 10 000 simulated data sets for each cluster geometry

Figure 3

Table 2 Comparison of the performance of the three clustering methods on 10 000 simulated data sets for each cluster geometry based on the ARI

Figure 4

Table 3 Pairwise agreement between the clustering solutions obtained with the GMM, the k-means algorithm and Ward’s method assessed by the ARI

Figure 5

Fig. 3 Clustering solutions with three clusters obtained with (a) the Gaussian mixture model (GMM), (b) the k-means algorithm and (c) Ward’s method, based on the IDEFICS CEHQ-FFQ data (1791 children). For each food item, the lengths of the corresponding bars represent the difference between the cluster-specific mean consumption frequencies and the overall mean consumption frequencies in the sample, measured in units of overall standard deviations for the single food items. The number of observations and the percentage of overweight and obese(25) (OW/OB) children are indicated for each cluster. IDEFICS, Identification and Prevention of Dietary- and Lifestyle-Induced Health Effects in Children and Infants; CEHQ, Children’s Eating Habits Questionnaire

Figure 6

Fig. 4 Scatter plot of a two-dimensional projection of a sub-sample of the pre-processed IDEFICS CEHQ-FFQ data as an example of zero inflation in FFQ data. IDEFICS, Identification and Prevention of Dietary- and Lifestyle-Induced Health Effects in Children and Infants; CEHQ, Children’s Eating Habits Questionnaire