Hostname: page-component-89b8bd64d-b5k59 Total loading time: 0 Render date: 2026-05-07T05:24:50.508Z Has data issue: false hasContentIssue false

Multivariate techniques and their application in nutrition: a metabolomics case study

Published online by Cambridge University Press:  01 July 2007

E. Katherine Kemsley
Affiliation:
Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK
Gwénaëlle Le Gall
Affiliation:
Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK
Jack R. Dainty*
Affiliation:
Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK
Andrew D. Watson
Affiliation:
Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK
Linda J. Harvey
Affiliation:
Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK
Henri S. Tapp
Affiliation:
Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK
Ian J. Colquhoun
Affiliation:
Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK
*
*Corresponding author: Jack R. Dainty, fax 01603 507723, email jack.dainty@bbsrc.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

The post-genomic technologies are generating vast quantities of data but many nutritional scientists are not trained or equipped to analyse it. In high-resolution NMR spectra of urine, for example, the number and complexity of spectral features mean that computational techniques are required to interrogate and display the data in a manner intelligible to the researcher. In addition, there are often multiple underlying biological factors influencing the data and it is difficult to pinpoint which are having the most significant effect. This is especially true in nutritional studies, where small variations in diet can trigger multiple changes in gene expression and metabolite concentration. One class of computational tools that are useful for analysing this highly multivariate data include the well-known ‘whole spectrum’ methods of principal component analysis and partial least squares. In this work, we present a nutritional case study in which NMR data generated from a human dietary Cu intervention study is analysed using multivariate methods and the advantages and disadvantages of each technique are discussed. It is concluded that an alternative approach, called feature subset selection, will be important in this type of work; here we have used a genetic algorithm to identify the small peaks (arising from metabolites of low concentration) that have been altered significantly following a dietary intervention.

Information

Type
Horizons in Nutritional Science
Copyright
Copyright © The Authors 2007
Figure 0

Fig. 1 (a) A set of high-resolution NMR spectra collected from twenty-four urine samples from a single volunteer. Spectra are shown offset for clarity; (b) an expansion of a region of the data shown in Fig. 1a. Spectra are shown offset for clarity.

Figure 1

Fig. 2 (a) A set of simulated ‘spectra’ consisting of only three variables (at three chemical shift values); (b) illustration of the concept of rigid data rotation, as implemented in principal component analysis and partial least square data compressions.

Figure 2

Fig. 3 (a) Principal component (PC) scores of the raw NMR dataset; (b) partial least squares (PLS) scores of the raw NMR data. (c) PLS scores of ‘noise’ with the same dimensions as the NMR data. (d) Cross-validated PLS scores of the NMR data. For plots (a), (b) and (d), ○, ‘pre-intervention’ samples; ■, ‘post-intervention’ samples.

Figure 3

Fig. 4 Illustrating data pre-processing. (a) ‘Standardized’ spectra (the same data as shown in Fig. 1(a), but here it has been mean-centred and scaled to unit variance); (b) threshold filtering: the points representd the variables retained.

Figure 4

Fig. 5 Cross-validated partial least squares (PLS) scores shown after various combinations of pre-treatment. (a) Filtering only; (b) filtering and standardization; (c) ‘area-normalization’ and filtering; (d) area-normalization, filtering and standardization. ○, ‘pre-intervention’ samples; ■, ‘post-intervention’ samples.

Figure 5

Fig. 6 Simple schematic showing the main steps involved in a genetic algorithm (GA) for feature selection in NMR data.

Figure 6

Fig. 7 Histogram showing relative frequency of occurrence of each data point within the 12000 best solutions obtained from the genetic algorithm (GA) repeats and, on the same horizontal scale, the mean of the raw NMR spectra.

Figure 7

Fig. 8 An expansion of a spectral region identified as important by the genetic algorithm feature selection. Spectra are shown offset for clarity. The lower part of the figure shows an expansion of the histogram in the same region. Main ‘features’ in the histogram are marked with the corresponding chemical shift.

Figure 8

Fig. 9 Data from three variables only (chemical shifts 0.848 ppm, 0.861 ppm and 0.755 ppm) taken from area-normalized and standardized datasets from the ‘training set’ volunteer (a), and additional volunteers (b)–(f). LDA, linear discriminant analysis; ICV, internal cross validation; Hypothesis H0a, the trivariate data for each volunteer contain no group structure; Hypothesis H0b, all subsets of three peaks have the same discriminatory performance. For details, see Feature subset selection.

Figure 9

Fig. 10 Scores plots obtained from applying (a) principal component (PC) and (b) partial least squares (PLS) to the spectral region shown in Fig. 6. ○, ‘pre-intervention’ samples; ■, ‘post-intervention’ samples. (c) Plots of the first and second PLS loadings obtained from the spectral region of Fig. 9, on the same horizontal axis as the mean spectra of the ‘pre-’ and ‘post-intervention’ groups. Loadings are shown ‘inverse-variance-scaled’.