Hostname: page-component-6766d58669-l4t7p Total loading time: 0 Render date: 2026-05-15T10:08:51.380Z Has data issue: false hasContentIssue false

Good practices and common pitfalls of machine learning in nutrition research

Published online by Cambridge University Press:  06 December 2024

Daniel Kirk*
Affiliation:
Department of Twin Research & Genetic Epidemiology, King’s College London, London, UK
*
Corresponding author: Daniel Kirk; Email: daniel.1.kirk@kcl.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Machine learning is increasingly being utilised across various domains of nutrition research due to its ability to analyse complex data, especially as large datasets become more readily available. However, at times, this enthusiasm has led to the adoption of machine learning techniques prior to a proper understanding of how they should be applied, leading to non-robust study designs and results of questionable validity. To ensure that research standards do not suffer, key machine learning concepts must be understood by the research community. The aim of this review is to facilitate a better understanding of machine learning in research by outlining good practices and common pitfalls in each of the steps in the machine learning process. Key themes include the importance of generating high-quality data, employing robust validation techniques, quantifying the stability of results, accurately interpreting machine learning outputs, adequately describing methodologies, and ensuring transparency when reporting findings. Achieving this aim will facilitate the implementation of robust machine learning methodologies, which will reduce false findings and make research more reliable, as well as enable researchers to critically evaluate and better interpret the findings of others using machine learning in their work.

Information

Type
Conference on ‘New Data – Focused Approaches and Challenges’
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of The Nutrition Society
Figure 0

Figure 1. The number of publications by year returned in PubMed with the search terms ‘machine learning nutrition’ from 2001 (the first year containing a publication with these search terms) to 2024. Date of search: 20th August 2024, 11:13 BST.

Figure 1

Table 1. A summary of key points and common pitfalls in each step in the machine learning procedure for research

Figure 2

Figure 2. A robust internal validation scheme using nested cross-validation. Data are first split using cross-validation (outer loop; step 1). In each fold of the outer loop, cross-validation is used on the training data (dark blue) for data processing, hyperparameter optimisation and feature selection. This is known as the inner loop (grey box; step 2). The performance of the model selected in the inner loop is then validated on the outer fold test data (dark orange; step 3). This process is depicted in the large, light blue box, and is repeated in each fold of the outer loop (step 4). The whole process is then repeated multiple times to account for instability of the results depending on how the dataset is split (step 5).

Figure 3

Figure 3. The effect of validation technique, the number of times it is repeated, and sample size on the stability and uncertainty of the results. One repeat of cross-validation (second row) is an immediate improvement over one repeat of train-test split (first row) because changes in the training and test data in each fold provide an indication of the stability of the AUC-ROC scores. Repeating the validation procedure multiple times with different subsamples also allows stability to be estimated, with this being more effective in cross-validation (fourth row) than train-test split (third row) because there are more test scores. Both instability and uncertainty tend to decrease as sample size increases.

Supplementary material: File

Kirk supplementary material

Kirk supplementary material
Download Kirk supplementary material(File)
File 1.9 KB