Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-08T01:06:00.888Z Has data issue: false hasContentIssue false

L2-Boosting algorithm applied to high-dimensional problems in genomic selection

Published online by Cambridge University Press:  29 July 2010

OSCAR GONZÁLEZ-RECIO*
Affiliation:
Departamento de Mejora Genética Animal, Instituto Nacional de Investigaciones Agrarias, Madrid 28040, Spain
KENT A. WEIGEL
Affiliation:
Department of Dairy Science, University of Wisconsin, Madison, WI 53706, USA
DANIEL GIANOLA
Affiliation:
Department of Dairy Science, University of Wisconsin, Madison, WI 53706, USA Department of Animal Sciences, University of Wisconsin, Madison, WI 53706, USA Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, P.O. Box 5003, N-1432, Ǻs, Norway
HUGO NAYA
Affiliation:
Unidad de Bioinformática, Instituto Pasteur de Montevideo, Montevideo 11400, Uruguay
GUILHERME J. M. ROSA
Affiliation:
Department of Dairy Science, University of Wisconsin, Madison, WI 53706, USA
*
*Corresponding author. Departamento Mejora Genética Animal, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria, Madrid 28040, Spain. Tel: +34913471486, e-mail: gonzalez.oscar@inia.es
Rights & Permissions [Opens in a new window]

Summary

The L2-Boosting algorithm is one of the most promising machine-learning techniques that has appeared in recent decades. It may be applied to high-dimensional problems such as whole-genome studies, and it is relatively simple from a computational point of view. In this study, we used this algorithm in a genomic selection context to make predictions of yet to be observed outcomes. Two data sets were used: (1) productive lifetime predicted transmitting abilities from 4702 Holstein sires genotyped for 32 611 single nucleotide polymorphisms (SNPs) derived from the Illumina® BovineSNP50 BeadChip, and (2) progeny averages of food conversion rate, pre-corrected by environmental and mate effects, in 394 broilers genotyped for 3481 SNPs. Each of these data sets was split into training and testing sets, the latter comprising dairy or broiler sires whose ancestors were in the training set. Two weak learners, ordinary least squares (OLS) and non-parametric (NP) regression were used for the L2-Boosting algorithm, to provide a stringent evaluation of the procedure. This algorithm was compared with BL [Bayesian LASSO (least absolute shrinkage and selection operator)] and BayesA regression. Learning tasks were carried out in the training set, whereas validation of the models was performed in the testing set. Pearson correlations between predicted and observed responses in the dairy cattle (broiler) data set were 0·65 (0·33), 0·53 (0·37), 0·66 (0·26) and 0·63 (0·27) for OLS-Boosting, NP-Boosting, BL and BayesA, respectively. The smallest bias and mean-squared errors (MSEs) were obtained with OLS-Boosting in both the dairy cattle (0·08 and 1·08, respectively) and broiler (−0·011 and 0·006) data sets, respectively. In the dairy cattle data set, the BL was more accurate (bias=0·10 and MSE=1·10) than BayesA (bias=1·26 and MSE=2·81), whereas no differences between these two methods were found in the broiler data set. L2-Boosting with a suitable learner was found to be a competitive alternative for genomic selection applications, providing high accuracy and low bias in genomic-assisted evaluations with a relatively short computational time.

Information

Type
Paper
Copyright
Copyright © Cambridge University Press 2010
Figure 0

Fig. 1. Schematic diagram of the boosting framework. Each weak learner fm(y, X) is trained on a weighted form of the training set (), which are the residuals from the previous weak learner (). The weak learner is tested on a tuning set at each iteration. Once all weak learners have been trained, those from iteration 1 to that minimizing MSE in the tuning set are combined to provide final estimates ŷ (). Adapted from Bishop (2006).

Figure 1

Fig. 2. MSE at each iteration for OLS-BOOST and NP-BOOST in the dairy cattle tuning set in each of the 10 folds (grey line) and the averaged MSE across folds (black solid line).

Figure 2

Fig. 3. MSE at each iteration for OLS-BOOST and NP-BOOST in the broiler tuning set in each of the 10 folds (grey line) and the averaged MSE across folds (black solid line). For convenience, only the first 1000 iterations are shown.

Figure 3

Table 1. Pearson correlation between yet to be observed records and their predictions in each testing set for L2-Boosting with OLS regression or NP regression as weak learners, BayesA and BL

Figure 4

Table 2. Bias and MSE of predicted responses in each testing set for L2-Boosting with OLS regression or NP regression as weak learners, BayesA and BL

Figure 5

Fig. 4. Relationship between 2008 progeny test PTA for productive lifetime (y) and 2003 genomic PTA (y_hat) in a testing set comprised 1398 Holstein bulls born from 1999 to 2002 using L2-Boosting with OLS regression or NP regression as weak learners, and with BayesA and BL. Intercept and slope estimates from the linear regressions are given.

Figure 6

Fig. 5. Relationship between observed average FCR, adjusted by environmental and mate effects, (y) and the genomic predicted FCR (y_hat) in a testing set comprised 61 broilers, which were progeny of 394 broilers in the training set using L2-Boosting with OLS regression or NP regression as weak learners, and BayesA and BL. Intercept and slope estimates from the linear regressions are given.