Skip to main content Accessibility help

Data pre-processing to improve the mining of large feed databases

  • F. Maroto-Molina (a1), A. Gómez-Cabrera (a2), J. E. Guerrero-Ginel (a2), A. Garrido-Varo (a2), D. Sauvant (a3), G. Tran (a4), V. Heuzé (a4) and D. C. Pérez-Marín (a2)...


The information stored in animal feed databases is highly variable, in terms of both provenance and quality; therefore, data pre-processing is essential to ensure reliable results. Yet, pre-processing at best tends to be unsystematic; at worst, it may even be wholly ignored. This paper sought to develop a systematic approach to the various stages involved in pre-processing to improve feed database outputs. The database used contained analytical and nutritional data on roughly 20 000 alfalfa samples. A range of techniques were examined for integrating data from different sources, for detecting duplicates and, particularly, for detecting outliers. Special attention was paid to the comparison of univariate and multivariate solutions. Major issues relating to the heterogeneous nature of data contained in this database were explored, the observed outliers were characterized and ad hoc routines were designed for error control. Finally, a heuristic diagram was designed to systematize the various aspects involved in the detection and management of outliers and errors.


Corresponding author



Hide All
Abreu, JM, Bruno-Soares, AM, Calouro, F 2000. Intake and nutritive value of Mediterranean forages and diets: 20 years of experimental data. ISA Press, Lisbon, Portugal.
Anderson, TW, Darling, DA 1952. Asymptotic theory of certain ‘goodness-of-fit’ criteria based on stochastic processes. Annals of Mathematical Statistics 23, 193212.
Breunig, MM, Kriegel, HP, Ng, RT, Sander, J 2000. LOF: identifying density-based local outliers. Retrieved May 15, 2012, from
Chauvenet, W 1960. A manual of spherical and practical astronomy. Dover Publications, New York, USA.
Gizzi, G, Givens, DI 2004. Variability in feed composition and its impact on animal production. Retrieved February 25, 2011, from
Han, J, Kamber, M 2006. Data mining: concepts and techniques. Elsevier, San Francisco, USA.
Hatfield, R, Fukushima, RS 2005. Can lignin be accurately measured? Crop Science 45, 832839.
Hawkins, DM 1980. Identification of outliers. Chapman and Hall, London, UK.
He, Z, Xu, X, Huang, JZ, Deng, S 2004. Mining class outliers: concepts, algorithms and applications in CRM. Expert System with Applications 27, 681697.
Hernández, MA, Stolfo, SJ 1998. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 937.
Jiménez-Márquez, SA, Lacroix, C, Thibault, J 2002. Statistical data validation methods for large cheese plant database. Journal of Dairy Science 85, 20812097.
Kalu, BA, Fick, GW 1981. Quantifying morphological development of alfalfa for studies of herbage quality. Crop Science 21, 267271.
Kotsiantis, SB, Kanellopoulos, D, Pintelas, P 2006. Data pre-processing for supervised learning. International Journal of Computer Science 1, 111117.
Maroto-Molina, F, Gómez-Cabrera, A, Guerrero-Ginel, JE, Garrido-Varo, A 2008. Propuesta para la homogenización de la información sobre alimentos: aplicación a la base de datos Pastos Españoles (SEEP). Pastos 38, 141184.
Maroto-Molina, F, Gómez-Cabrera, A, Guerrero-Ginel, JE, Garrido-Varo, A, Pérez-Marín, DC 2011. Building a metadata framework for sharing feed information in Spain. Journal of Animal Science 89, 882888.
Mendenhall, W, Reinmuth, JE 1971. Statistics for management and economics. Duxury Press, Belmont, USA.
Molina, LC 2002. Data mining: torturando los datos hasta que confiesen. Retrieved March 12, 2012, from
Moore, KJ, Moser, LE, Vogel, KP, Waller, SS, Johnson, BE, Pedersen, JF 1991. Describing and quantifying growth stages of perennial forage grasses. Agronomy Journal 83, 10731077.
Mueller, SC, Teuber, LR 2007. Alfalfa growth and development. Retrieved July 4, 2011, from
Müller, H, Freytag, J 2003. Problems, methods and challenges in comprehensive data cleansing. Retrieved September 30, 2012, from
NcDowell, LR, Conrad, JH, Thomas, JE, Harris, LE, Fick, KR 1977. Nutritional composition of Latin American forages. Tropical Animal Production 2, 273279.
Palmquist, DL, Jenkins, TC 2003. Challenges with fats and fatty acid methods. Journal of Animal Science 81, 32503254.
Piramuthu, S 2006. On pre-processing data for financial credit risk evaluation. Expert System with Applications 30, 489497.
Pyle, D 1999. Data preparation for data mining. Morgan Kaufmann, San Francisco, USA.
Sauvant, D, Pérez, JM, Tran, G 2002. Tables de composition et de valeur nutritive des matieres premieres destinees aux animaux d'elevage. INRA Editions, Paris, France.
Sauvant, D, Tran, G, Heuzé, V, Bastianelli, D, Archimède, H 2010. Data engineering for creating feed tables and animal models in the tropical context. Advances in Animal Biosciences 1, 438439.
Tedeschi, LO, Fox, DG, Pell, AN, Duarte, DP, Boin, C 2002. Development and evaluation of a tropical feed library for the Cornell net carbohydrate and protein system model. Scientia Agricola 59, 118.
Tran, G, Lapierre, O 1997. The French feed database: a national network for collecting and disseminating data about feedstuff composition and nutritive value. In First European Conference for Information Technology in Agriculture (ed. H Kure, I Thysen and AR Kristensen), pp. 105108. Copenhagen, Denmark.
Tran, G, Heuzé, V, Bastianelli, D, Archimède, H, Sauvant, D 2010. Tables of nutritive value for farm animals in tropical and Mediterranean regions: an important asset for improving the use of local feed resources. Advances in Animal Biosciences 1, 468469.
Trujillo-Ortiz, A, Hernández-Walls, R, Castro-Pérez, A, Barba-Rojo, K 2006. MOUTLIER: Detection of outlier in multivariate sample test, a MATLAB file. Retrieved June 15, 2012, from
Wang, RY, Reddy, MP, Kon, HB 1995. Towards quality data: an attribute based approach. Decision Support Systems 13, 349372.
Wilks, SS 1963. Multivariate statistical outliers. Indian Journal of Statistics 25, 407426.
Wu, Z 2009. A review of statistical methods for pre-processing oligonucleotide arrays. Statistical Methods in Medical Research 18, 533541.
Yang, SS, Lee, Y 1987. Identification of a multivariate outlier. Paper presented at the Annual Meeting of the American Statistical Association, August 1987, San Francisco, USA.


Type Description Title
Supplementary materials

Maroto Molina Supplementary Material

 Unknown (664 KB)
664 KB

Data pre-processing to improve the mining of large feed databases

  • F. Maroto-Molina (a1), A. Gómez-Cabrera (a2), J. E. Guerrero-Ginel (a2), A. Garrido-Varo (a2), D. Sauvant (a3), G. Tran (a4), V. Heuzé (a4) and D. C. Pérez-Marín (a2)...


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed