Skip to main content Accessibility help
×
Home

L2-Boosting algorithm applied to high-dimensional problems in genomic selection

  • OSCAR GONZÁLEZ-RECIO (a1), KENT A. WEIGEL (a2), DANIEL GIANOLA (a2) (a3) (a4), HUGO NAYA (a5) and GUILHERME J. M. ROSA (a2)...

Summary

The L2-Boosting algorithm is one of the most promising machine-learning techniques that has appeared in recent decades. It may be applied to high-dimensional problems such as whole-genome studies, and it is relatively simple from a computational point of view. In this study, we used this algorithm in a genomic selection context to make predictions of yet to be observed outcomes. Two data sets were used: (1) productive lifetime predicted transmitting abilities from 4702 Holstein sires genotyped for 32 611 single nucleotide polymorphisms (SNPs) derived from the Illumina® BovineSNP50 BeadChip, and (2) progeny averages of food conversion rate, pre-corrected by environmental and mate effects, in 394 broilers genotyped for 3481 SNPs. Each of these data sets was split into training and testing sets, the latter comprising dairy or broiler sires whose ancestors were in the training set. Two weak learners, ordinary least squares (OLS) and non-parametric (NP) regression were used for the L2-Boosting algorithm, to provide a stringent evaluation of the procedure. This algorithm was compared with BL [Bayesian LASSO (least absolute shrinkage and selection operator)] and BayesA regression. Learning tasks were carried out in the training set, whereas validation of the models was performed in the testing set. Pearson correlations between predicted and observed responses in the dairy cattle (broiler) data set were 0·65 (0·33), 0·53 (0·37), 0·66 (0·26) and 0·63 (0·27) for OLS-Boosting, NP-Boosting, BL and BayesA, respectively. The smallest bias and mean-squared errors (MSEs) were obtained with OLS-Boosting in both the dairy cattle (0·08 and 1·08, respectively) and broiler (−0·011 and 0·006) data sets, respectively. In the dairy cattle data set, the BL was more accurate (bias=0·10 and MSE=1·10) than BayesA (bias=1·26 and MSE=2·81), whereas no differences between these two methods were found in the broiler data set. L2-Boosting with a suitable learner was found to be a competitive alternative for genomic selection applications, providing high accuracy and low bias in genomic-assisted evaluations with a relatively short computational time.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      L2-Boosting algorithm applied to high-dimensional problems in genomic selection
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      L2-Boosting algorithm applied to high-dimensional problems in genomic selection
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      L2-Boosting algorithm applied to high-dimensional problems in genomic selection
      Available formats
      ×

Copyright

Corresponding author

*Corresponding author. Departamento Mejora Genética Animal, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria, Madrid 28040, Spain. Tel: +34913471486, e-mail: gonzalez.oscar@inia.es

References

Hide All
Aguilar, I., Misztal, I., Johnson, D. L., Legarra, A., Tsuruta, S. & Lawlor, T. J. (2010). A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. Journal of Dairy Science 93, 743752.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning 24, 123140.
Bühlmann, P. (2006). Boosting for high-dimensional linear models. Annals of Statistics 34, 559583.
Bühlmann, P. & Yu, B. (2003). Boosting with the L2 loss: regression and classification. Journal of the American Statistical Association 98, 324339.
Bühlmann, P. & Yu, B. (2006). Sparse Boosting. Journal of Machine Learning Research 7, 10011024.
Cornillon, P. A., Hengartner, N. & Matzner-Lober, E. (2008). Recursive bias estimation and L2 boosting. Tech. rep., ArXiv:0801.4629.
de los Campos, G., Naya, H., Gianola, D., Crossa, J., Legarra, A., Manfredi, E., Weigel, K. A. & Cotes, J. M. (2009). Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics 182, 375385.
Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Thirteen International Conference on Machine Learning (ed. Saitta, L.), pp. 158–156. San Francisco, CA: Morgan Kaufmann. ISBN 1-55860-419-7.
Friedman, J. H. (2001). Greedy functions approximation: a gradient boosting machine. Annals of Statistics 29, 11891232.
Gianola, D., Fernando, R. L. & Stella, A. (2006). Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173, 17611776.
Gianola, D., de los Campos, G., Campos, W. G., Manfredi, E. & Fernando, R. (2009). Additive genetic variability and the Bayesian alphabet. Genetics 183, 347363.
Gonzalez-Recio, O., Gianola, D., Long, N., Weigel, K. A., Rosa, G. J. M. & Avendaño, S. (2008). Nonparametric methods for incorporating genomic information into genetic evaluations: an application to mortality in broilers. Genetics 178, 23052313.
Gonzalez-Recio, O., Gianola, D., Rosa, G. J. M., Weigel, K. A. & Kranis, A. (2009). Genome-assisted prediction of a quantitative trait measured in parents and progeny: application to food conversion rate in chickens. Genetics Selection Evolution 41, 3.
Hill, W. G., Goddard, M. E. & Visscher, P. M. (2008). Data and theory point to mainly additive genetic variance for complex traits. PLoS Genetics 4, e1000008. doi: 10.1371/journal.pgen.1000008
Long, N., Gianola, D., Rosa, G. J. M., Weigel, K. A. & Avendaño, S. (2007) Machine learning classification procedure for selecting SNPs in genomic selection: Application to early mortality in broilers. Journal of Animal Breeding and Genetics 124, 377389.
Long, N., Gianola, D., Rosa, G. J. M., Weigel, K. A., Kranis, A. & González-Recio, O. (2010). Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research (in press).
Lowe, C. E., Cooper, J. D., Chapman, J. M., Barratt, B. J., Twells, R. C. J., Green, E. A., Savage, D. A., Guja, C., Ionescu-Tîrgovişte, C., Tuomilehto-Wolf, E., Tuomilehto, J., Todd, J. A. & Clayton, D. G. (2004). Cost effective analyses of candidate genes using htSNP: a staged approach. Genes and Immunity 5, 301305.
Lutz, R. W., Kalisch, M. & Bühlmann, P. (2008). Robustified L2 boosting. Computational Statistics and Data Analysis 52, 33313341.
Meuwissen, T. H. E., Hayes, B. J. & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 18191829.
Nadaraya, E. A. (1964). On estimating regression. Theory of Probability and Its Applications 9, 141142.
Opitz, D. & Maclin, R. (1999). Popular en1semble methods: an empirical study. Journal of Artificial Intelligence Research 11, 169198.
Park, T. & Casella, G. (2008). The Bayesian LASSO. Journal of American Statistical Association 103, 681686.
Solberg, T. R., Sonesson, A. K., Woolliams, J. A. & Meuwissen, T. H. E. (2008). Genomic selection using different marker types and densities. Journal of Animal Science 86, 24472454.
Szymczak, S., Biernacka, J. M., Cordell, H. J., González-Recio, O., König, I. R., Zhang, H. & Sun, Y. V. (2009). Machine learning in genome-wide association studies. Genetic Epidemiology 33, S51S57.
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B 58, 267288.
VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 44144423.
VanRaden, P. M., Van Tassell, C. P., Wiggans, G. R., Sonstegard, T. S., Schnabel, R. D., Taylor, J. F. & Schenkel, F. S. (2009). Invited review: Reliability of genomic predictions for North American Holstein bulls. Journal of Dairy Science 92, 1624.
Watson, G. S. (1964). Smooth regression analysis. Sankhya A 26, 359372.
Weigel, K. A., de los Campos, G., González-Recio, O., Naya, H., Wu, X. L., Long, N., Rosa, G. J. M. & Gianola, D. (2009). Predictive ability of direct genomic values for lifetime net merit using of Holstein sires using selected subsets of single nucleotide polymorphism markers. Journal of Dairy Science 92, 52485257.

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed