Hostname: page-component-89b8bd64d-46n74 Total loading time: 0 Render date: 2026-05-08T07:14:19.039Z Has data issue: false hasContentIssue false

Machine learning identifies differences between breast milk and formula in the gut microbiome

Published online by Cambridge University Press:  08 May 2026

Ting Chia Liu
Affiliation:
Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences , Faculty of Science, Utrecht, The Netherlands
David Rojas-Velazquez*
Affiliation:
Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences , Faculty of Science, Utrecht, The Netherlands Department of Data Science, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
Sarah Kidwai
Affiliation:
Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences , Faculty of Science, Utrecht, The Netherlands
Astrid Hogenkamp
Affiliation:
Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences , Faculty of Science, Utrecht, The Netherlands
Johan Garssen
Affiliation:
Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences , Faculty of Science, Utrecht, The Netherlands Danone Global Research & Innovation Center , Utrecht, The Netherlands
Aletta D. Kraneveld
Affiliation:
Department of Neuroscience, Faculty of Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Alejandro Lopez-Rincon
Affiliation:
Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences , Faculty of Science, Utrecht, The Netherlands
*
Corresponding author: David Rojas-Velazquez; Email: e.d.rojasvelazquez@uu.nl

Abstract

In this study, we analysed differences in the infant gut microbiome between breastfed and formula-fed infants using novel machine learning techniques. Breast milk, rich in bioactive agents, supports microbiota composition and immune development, while formulas aim to replicate its nutritional profile. We applied a methodology combining the DADA2 pipeline for 16S rRNA sequencing with the Recursive Ensemble Feature Selection (REFS) algorithm for biomarker discovery. We analysed three publicly available 16S rRNA datasets: PRJNA633365 (70 stool samples from China), PRJDB7295 (40 stool samples from the Philippines), and PRJNA562650 (40 stool samples from China). The discovery dataset (PRJNA633365) revealed 16 significant taxa out of 1,227, validated across the other two datasets. Next, we compared REFS performance with another feature selection algorithm, SelectKBest. Finally, we conducted a literature review to explore links between identified taxa and medical conditions. Additionally, we used MicrobiomeAnalyst to examine associations with diseases, diet, and lifestyle. Our results show differences in the bacterial composition between breastfed and formula-fed infants, and these findings were validated in two independent datasets. Future research should explore the functional roles of these taxa and consider regional and dietary variability to enhance understanding of microbiome dynamics and long-term health outcomes.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press in association with The Nutrition Society
Figure 0

Figure 1. Overview of the methodology followed in this research work. The upper section, from left to right, shows the dataset selection criteria, raw data processing, and feature selection phases, including the analysis conducted using MicrobiomeAnalyst and the SelectKbest experiment. The lower section corresponds to the testing phase.

Figure 1

Table 1. Summary of datasets characteristics: NCBI accession numbers, targeted 16S rRNA gene regions, approximate raw read lengths, final ASV lengths resulting from DADA2 processing, trimming/truncation parameters, and the total number of ASVs generated on each dataset.

Figure 2

Figure 2. (A) Feature selection phase: the reduced set of features that achieves the highest accuracy is identified by the red line; (B) AUC-ROC for the classifier with the best performance in the validation module applied to the discovery dataset PRJNA633365.

Figure 3

Figure 3. Taxonomy distribution at the genus level. The numbers represent the number of elements contained in each genus-level bacterial taxa.

Figure 4

Figure 4. AUC-ROC plots corresponding to the classifier with the best performance for the two testing datasets: (A) AdaBoost for PRJDB7295, (B) MLP for PRJNA562650.

Figure 5

Figure 5. Heatmap for a graphic representation of the relative abundance of the selected taxa (increase or decrease) in breast-milk samples compared to formula samples. Genera are displayed from left to right following the top-to-bottom order as presented in Supplementary File 1.

Supplementary material: File

Chia Liu et al. supplementary material

Chia Liu et al. supplementary material
Download Chia Liu et al. supplementary material(File)
File 243.1 KB