Skip to main content Accessibility help

Data preprocessing in predictive data mining

  • Stamatios-Aggelos N. Alexandropoulos (a1), Sotiris B. Kotsiantis (a1) and Michael N. Vrahatis (a1)


A large variety of issues influence the success of data mining on a given problem. Two primary and important issues are the representation and the quality of the dataset. Specifically, if much redundant and unrelated or noisy and unreliable information is presented, then knowledge discovery becomes a very difficult problem. It is well-known that data preparation steps require significant processing time in machine learning tasks. It would be very helpful and quite useful if there were various preprocessing algorithms with the same reliable and effective performance across all datasets, but this is impossible. To this end, we present the most well-known and widely used up-to-date algorithms for each step of data preprocessing in the framework of predictive data mining.



Hide All
Aggarwal, C. C. 2013. An introduction to outlier analysis. In Outlier Analysis., Springer, 1–40.
Angiulli, F. & Fassetti, F. 2014. Exploiting domain knowledge to detect outliers. Data Mining and Knowledge Discovery 28(2), 519568.
Angiulli, F. & Pizzuti, C. 2005. Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering 17(2), 203215.
Aridas, C. K., Kotsiantis, S. B. & Vrahatis, M. N. 2016. Combining prototype selection with local boosting. In Artificial Intelligence Applications and Innovations (AIAI) 2016. IFIP Advances in Information and Communication Technology, Iliadis, L. & Maglogiannis, I. (eds). Springer, 475.
Aridas, C. K., Kotsiantis, S. B. & Vrahatis, M. N. 2017. Hybrid local boosting utilizing unlabeled data in classification tasks. Evolving Systems 111.
Arnaiz-González, Á., Dez-Pastor, J.-F., RodríGuez, J. J. & Garca-Osorio, C. 2016. Instance selection of linear complexity for big data. Knowledge-Based Systems 107, 8395.
Augasta, M. G. & Kathirvalavakumar, T. 2012. A new discretization algorithm based on range coefficient of dispersion and skewness for neural networks classifier. Applied Soft Computing 12(2), 619625.
Batista, G. E., Prati, R. C. & Monard, M. C. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1), 2029.
Bezdek, J. C. & Kuncheva, L. I. 2001. Nearest prototype classifier designs: an experimental study. International Journal of Intelligent Systems 16(12), 14451473.
Boulle, M. 2004. KHIOPS: a statistical discretization method of continuous attributes. Machine Learning 55(1), 5369.
Brodley, C. E. & Friedl, M. A. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11, 131167.
Buzzi-Ferraris, G. & Manenti, F. 2011. Outlier detection in large data sets. Computers & Chemical Engineering 35(2), 388390.
Caises, Y., González, A., Leyva, E. & Pérez, R. 2011. Combining instance selection methods based on data characterization: an approach to increase their effectiveness. Information Sciences 181(20), 47804798.
Cano, A., Nguyen, D. T., Ventura, S. & Cios, K. J. 2016. ur-CAIM: improved CAIM discretization for unbalanced and balanced data. Soft Computing 20(1), 173188.
Cano, J.-R., Garca, S. & Herrera, F. 2008. Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes. Pattern Recognition Letters 29(16), 21562164.
Cano, J. R., Herrera, F. & Lozano, M. 2005. Strategies for scaling up evolutionary instance reduction algorithms for data mining. In Evolutionary Computation in Data Mining, Springer, 21–39.
Caruana, R. & de Sa, V. R. 2003. Benefitting from the variables that variable selection discards. Journal of Machine Learning Research 3(3), 12451264.
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321357.
Chen, S., Wang, W. & van Zuylen, H. 2010a. A comparison of outlier detection algorithms for ITS data. Expert Systems with Applications 37(2), 11691178.
Chen, Y., Miao, D. & Zhang, H. 2010b. Neighborhood outlier detection. Expert Systems with Applications 37(12), 87458749.
Chow, T. W. & Huang, D. 2005. Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Transactions on Neural Networks 16(1), 213224.
Cismondi, F., Fialho, A. S., Vieira, S. M., Reti, S. R., Sousa, J. M. & Finkelstein, S. N. 2013. Missing data in medical databases: impute, delete or classify? Artificial Intelligence in Medicine 58(1), 6372.
Crone, S. F., Lessmann, S. & Stahlbock, R. 2006. The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. European Journal of Operational Research 173(3), 781800.
Czarnowski, I. 2010. Prototype selection algorithms for distributed learning. Pattern Recognition 43(6), 22922300.
Czarnowski, I. 2012. Cluster-based instance selection for machine classification. Knowledge and Information Systems 30(1), 113133.
de Haro-García, A. & García-Pedrajas, N. 2009. A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery 18(3), 392418.
de Sá, C. R., Soares, C. & Knobbe, A. 2016. Entropy-based discretization methods for ranking data. Information Sciences 329, 921936.
Delany, S. J., Segata, N. & Mac Namee, B. 2012. Profiling instances in noise reduction. Knowledge-Based Systems 31, 2840.
Derrac, J., Garca, S. & Herrera, F. 2010a. IFS-CoCo: instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recognition 43(6), 20822105.
Derrac, J., Garca, S. & Herrera, F. 2010b. A survey on evolutionary instance selection and generation. International Journal of Applied Metaheuristic Computing 1, 6092.
Dougherty, J., Kohavi, R. & Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995, 194–202. Elsevier.
Ekambaram, R., Fefilatyev, S., Shreve, M., Kramer, K., Hall, L. O., Goldgof, D. B. & Kasturi, R. 2016. Active cleaning of label noise. Pattern Recognition 51, 463480.
Elomaa, T. & Rousu, J. 2004. Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining and Knowledge Discovery 8(2), 97126.
Escalante, H. J. 2005. A comparison of outlier detection algorithms for machine learning. In Proceedings of the International Conference on Communications in Computing, 228–237.
Estabrooks, A., Jo, T. & Japkowicz, N. 2004. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20(1), 1836.
Farhangfar, A., Kurgan, L. & Dy, J. 2008. Impact of imputation of missing values on classification error for discrete data. Pattern Recognition 41(12), 36923705.
Farhangfar, A., Kurgan, L. & Pedrycz, W. 2007. A novel framework for imputation of missing values in databases. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans 37(5), 692709.
Farquad, M. & Bose, I. 2012. Preprocessing unbalanced data using support vector machine. Decision Support Systems 53(1), 226233.
Fernández, A., Carmona, C. J., del Jesus, M. & Herrera, F. 2017. A pareto based ensemble with feature and instance selection for learning from multi-class imbalanced datasets. International Journal of Neural Systems 27, 117.
Filzmoser, P., Maronna, R. & Werner, M. 2008. Outlier identification in high dimensions. Computational Statistics & Data Analysis 52(3), 16941711.
Flores, M. J., Gámez, J. A., Martnez, A. M. & Puerta, J. M. 2011. Handling numeric attributes when comparing bayesian network classifiers: does the discretization method matter? Applied Intelligence 34(3), 372385.
Galar, M., Fernández, A., Barrenechea, E., Bustince, H. & Herrera, F. 2011. An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition 44(8), 17611776.
Garcia, L. P., de Carvalho, A. C. & Lorena, A. C. 2016a. Noise detection in the meta-learning level. Neurocomputing 176, 1425.
Garcia, L. P., Lorena, A. C., Matwin, S. & de Carvalho, A. C. 2016b. Ensembles of label noise filters: a ranking approach. Data Mining and Knowledge Discovery 30(5), 11921216.
García, S., Cano, J. R. & Herrera, F. 2008. A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recognition 41(8), 26932709.
García, S., Derrac, J., Cano, J. & Herrera, F. 2012a. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3), 417435.
García, S., Luengo, J. & Herrera, F. 2015. Data Preprocessing in Data Mining. Springer.
García, S., Luengo, J., Sáez, J. A., Lopez, V. & Herrera, F. 2013. A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25(4), 734750.
García, V., Sánchez, J. S. & Mollineda, R. A. 2012b. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems 25(1), 1321.
García-Pedrajas, N., Del Castillo, J. A. R. & Ortiz-Boyer, D. 2010. A cooperative coevolutionary algorithm for instance selection for instance-based learning. Machine Learning 78(3), 381420.
GarcíA-Pedrajas, N. & PéRez-RodríGuez, J. 2012. Multi-selection of instances: a straightforward way to improve evolutionary instance selection. Applied Soft Computing 12(11), 35903602.
Ghoting, A., Parthasarathy, S. & Otey, M. E. 2008. Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery 16(3), 349364.
Gonzalez-Abril, L., Cuberos, F. J., Velasco, F. & Ortega, J. A. 2009. AMEVA: an autonomous discretization algorithm. Expert Systems with Applications 36(3), 53275332.
Gupta, A., Mehrotra, K. G. & Mohan, C. 2010. A clustering-based discretization for supervised learning. Statistics & Probability Letters 80(9), 816824.
Guyon, I. & Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3(3), 11571182.
He, H. & Garcia, E. A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 12631284.
Hernandez-Leal, P., Carrasco-Ochoa, J. A., Martnez-Trinidad, J. F. & Olvera-Lopez, J. A. 2013. Instancerank based on borders for instance selection. Pattern Recognition 46(1), 365375.
Hinton, G. E. & Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786), 504507.
Hoffmann, H. 2007. Kernel PCA for novelty detection. Pattern Recognition 40(3), 863874.
Honghai, F., Guoshun, C., Cheng, Y., Bingru, Y. & Yumei, C. 2005. A SVM regression based approach to filling in missing values. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, 581–587. Springer.
Hu, Q., Che, X., Zhang, L. & Yu, D. 2010. Feature evaluation and selection based on neighborhood soft margin. Neurocomputing 73(10), 21142124.
Hua, J., Tembe, W. D. & Dougherty, E. R. 2009. Performance of feature-selection methods in the classification of high-dimension data. Pattern Recognition 42(3), 409424.
Hua, J., Xiong, Z., Lowey, J., Suh, E. & Dougherty, E. R. 2005. Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21(8), 15091515.
Huang, H., Lin, J., Chen, C. & Fan, M. 2006. Review of outlier detection. Application Research of Computers 8, 20062008.
Janssens, D., Brijs, T., Vanhoof, K. & Wets, G. 2006. Evaluating the performance of cost-based discretization versus entropy-and error-based discretization. Computers & Operations Research 33(11), 31073123.
Jerez, J. M., Molina, I., Garca-Laencina, P. J., Alba, E., Ribelles, N., Martn, M. & Franco, L. 2010. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine 50(2), 105115.
Jin, R., Breitbart, Y. & Muoh, C. 2009. Data discretization unification. Knowledge and Information Systems 19(1), 129.
Kennedy, J. & Eberhart, R. C. 1995. Particle swarm optimization. In IEEE International Conference on Neural Networks Proceedings 1995, 4, 1942–1948. IEEE.
Kim, S., Cho, N. W., Kang, B. & Kang, S.-H. 2011. Fast outlier detection for very large log data. Expert Systems with Applications 38(8), 95879596.
Kim, S.-W. & Oommen, B. J. 2003. A brief taxonomy and ranking of creative prototype reduction schemes. Pattern Analysis & Applications 6(3), 232244.
Klinkenberg, R. 2004. Learning drifting concepts: example selection vs. example weighting. Intelligent Data Analysis 8(3), 281300.
Kumar, A. & Zhang, D. 2007. Hand-geometry recognition using entropy-based discretization. IEEE Transactions on Information Forensics and Security 2(2), 181187.
Kurgan, L. A. & Cios, K. J. 2004. CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering 16(2), 145153.
Lemaître, G., Nogueira, F. & Aridas, C. K. 2017. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18(17), 15.
Li, M., Deng, S., Feng, S. & Fan, J. 2011. An effective discretization based on class-attribute coherence maximization. Pattern Recognition Letters 32(15), 19621973.
Lin, W.-C., Tsai, C.-F., Ke, S.-W., Hung, C.-W. & Eberle, W. 2015. Learning to detect representative data for large scale instance selection. Journal of Systems and Software 106, 18.
Liu, C., Wang, W., Wang, M., Lv, F. & Konan, M. 2017. An efficient instance selection algorithm to reconstruct training set for support vector machine. Knowledge-Based Systems 116, 5873.
Liu, F. T., Ting, K. M. & Zhou, Z.-H. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6(1), 139.
Liu, H., Hussain, F., Tan, C. L. & Dash, M. 2002. Discretization: an enabling technique. Data Mining and Knowledge Discovery 6(4), 393423.
Liu, H. & Motoda, H. 2002. On issues of instance selection. Data Mining and Knowledge Discovery 6(2), 115130.
Liu, H. & Motoda, H. 2007. Computational Methods of Feature Selection. CRC Press.
Liu, X. & Wang, H. 2005. A discretization algorithm based on a heterogeneity criterion. IEEE Transactions on Knowledge and Data Engineering 17(9), 11661173.
Liu, Z.-G., Pan, Q., Dezert, J. & Martin, A. 2016. Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognition 52, 8595.
Lobato, F., Sales, C., Araujo, I., Tadaiesky, V., Dias, L., Ramos, L. & Santana, A. 2015. Multi-objective genetic algorithm for missing data imputation. Pattern Recognition Letters 68, 126131.
López, V., Fernández, A., Moreno-Torres, J. G. & Herrera, F. 2012. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Systems with Applications 39(7), 65856608.
Losing, V., Hammer, B. & Wersing, H. 2018. Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275, 12611274.
Luengo, J., Garca, S. & Herrera, F. 2012. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems 32(1), 77108.
Mahanipour, A., Nezamabadi-pour, H. & Nikpour, B. 2018. Using fuzzy-rough set feature selection for feature construction based on genetic programming. In 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), 1–6. IEEE.
Maldonado, S. & Weber, R. 2009. A wrapper method for feature selection using support vector machines. Information Sciences 179(13), 22082217.
Mao, K. 2004. Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34(1), 629634.
Marchiori, E. 2008. Hit miss networks with applications to instance selection. Journal of Machine Learning Research 9(6), 9971017.
Mateos-García, D., García-Gutiérrez, J. & Riquelme-Santos, J. C. 2012. On the evolutionary optimization of k-NN by label-dependent feature weighting. Pattern Recognition Letters 33(16), 22322238.
Mizianty, M. J., Kurgan, L. A. & Ogiela, M. R. 2010. Discretization as the enabling technique for the naive Bayes and semi-naive Bayes-based classification. The Knowledge Engineering Review 25(04), 421449.
Moreno-Torres, J. G., Raeder, T., Alaiz-RodríGuez, R., Chawla, N. V. & Herrera, F. 2012. A unifying view on dataset shift in classification. Pattern Recognition 45(1), 521530.
Nanni, L. & Lumini, A. 2011. Prototype reduction techniques: a comparison among different approaches. Expert Systems with Applications 38(9), 1182011828.
Nikolaidis, K., Goulermas, J. Y. & Wu, Q. 2011. A class boundary preserving algorithm for data condensation. Pattern Recognition 44(3), 704715.
Olvera-López, J. A., Carrasco-Ochoa, J. A., Martnez-Trinidad, J. F. & Kittler, J. 2010. A review of instance selection methods. Artificial Intelligence Review 34(2), 133143.
Panday, D., de Amorim, R. C. & Lane, P. 2018. Feature weighting as a tool for unsupervised feature selection. Information Processing Letters 129, 4452.
Park, D.-C. 2009. Centroid neural network with weighted features. Journal of Circuits, Systems, and Computers 18(08), 13531367.
Parsopoulos, K. E. & Vrahatis, M. N. 2010. Particle Swarm Optimization and Intelligence: Advances and Applications. Information Science Publishing (IGI Global).
Pearson, R. K. 2005. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. SIAM.
Piramuthu, S. 2004. Evaluating feature selection methods for learning in data mining applications. European Journal of Operational Research 156(2), 483494.
Piramuthu, S. & Sikora, R. T. 2009. Iterative feature construction for improving inductive learning algorithms. Expert Systems with Applications 36(2), 34013406.
Pkekalska, E., Duin, R. P. & Paclk, P. 2006. Prototype selection for dissimilarity-based classifiers. Pattern Recognition 39(2), 189208.
Pyle, D. 1999. Data Preparation for Data Mining, 1. Morgan Kaufmann.
Qin, Y., Zhang, S., Zhu, X., Zhang, J. & Zhang, C. 2009. POP algorithm: kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications 36(2), 27942804.
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. 2009. Dataset Shift in Machine Learning. MIT Press.
Ramírez-Gallego, S., Garca, S., Mouriño-Taln, H., Martnez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Bentez, J. M. & Herrera, F. 2016. Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 6(1), 521.
Ramírez-Gallego, S., Krawczyk, B., Garca, S., Woźniak, M. & Herrera, F. 2017. A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 3956.
Reinartz, T. 2002. A unifying view on instance selection. Data Mining and Knowledge Discovery 6(2), 191210.
Sáez, J. A., Galar, M., Luengo, J. & Herrera, F. 2016. INFFC: an iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Information Fusion 27, 1932.
Sáez, J. A., Luengo, J. & Herrera, F. 2013. Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognition 46(1), 355364.
Segata, N., Blanzieri, E., Delany, S. J. & Cunningham, P. 2010. Noise reduction for instance-based learning with a local maximal margin approach. Journal of Intelligent Information Systems 35(2), 301331.
Shen, C., Wang, X. & Yu, D. 2012. Feature weighting of support vector machines based on derivative saliency analysis and its application to financial data mining. International Journal of Advancements in Computing Technology 4(1), 199206.
Shu, W. & Shen, H. 2016. Multi-criteria feature selection on cost-sensitive data with missing values. Pattern Recognition 51, 268280.
Silva-Ramírez, E.-L., Pino-Mejas, R., López-Coello, M. & Cubiles-de-la Vega, M.-D. 2011. Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1), 121129.
Sim, J., Kwon, O. & Lee, K. C. 2016. Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets. Expert Systems with Applications 46, 485493.
Skillicorn, D. B. & McConnell, S. M. 2008. Distributed prediction from vertically partitioned data. Journal of Parallel and Distributed Computing 68(1), 1636.
Smith, M. G. & Bull, L. 2005. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines 6(3), 265281.
Somol, P. & Pudil, P. 2002. Feature selection toolbox. Pattern Recognition 35(12), 27492759.
Sun, Y., Wong, A. K. & Kamel, M. S. 2009. Classification of imbalanced data: a review. International Journal of Pattern Recognition and Artificial Intelligence 23(04), 687719.
Triguero, I., Derrac, J., Garcia, S. & Herrera, F. 2012a. Integrating a differential evolution feature weighting scheme into prototype generation. Neurocomputing 97, 332343.
Triguero, I., Derrac, J., Garcia, S. & Herrera, F. 2012b. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(1), 86100.
Tsai, C.-F. & Chang, F.-Y. 2016. Combining instance selection for better missing value imputation. Journal of Systems and Software 122, 6371.
Tsai, C.-J., Lee, C.-I. & Yang, W.-P. 2008. A discretization algorithm based on class-attribute contingency coefficient. Information Sciences 178(3), 714731.
Unler, A., Murat, A. & Chinnam, R. B. 2011. mr 2 PSO: a maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Information Sciences 181(20), 46254641.
van Hulse, J. & Khoshgoftaar, T. 2009. Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68(12), 15131542.
van Hulse, J. & Khoshgoftaar, T. M. 2006. Class noise detection using frequent itemsets. Intelligent Data Analysis 10(6), 487507.
van Hulse, J. D., Khoshgoftaar, T. M. & Huang, H. 2007. The pairwise attribute noise detection algorithm. Knowledge and Information Systems 11(2), 171190.
Virgolin, M., Alderliesten, T., Bel, A., Witteveen, C. & Bosman, P. A. 2018. Symbolic regression and feature construction with gp-gomea applied to radiotherapy dose reconstruction of childhood cancer survivors. In Proceedings of the Genetic and Evolutionary Computation Conference, 1395–1402. ACM.
Wang, B. & Japkowicz, N. 2004. Imbalanced data set learning with synthetic samples. In Proc. IRIS Machine Learning Workshop, 19.
Wettschereck, D., Aha, D. W. & Mohri, T. 1997. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review 11(1–5), 273314.
Wilson, D. R. & Martinez, T. R. 2000. Reduction techniques for instance-based learning algorithms. Machine Learning 38(3), 257286.
Witten, I. H., Frank, E., Hall, M. A. & Pal, C. J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.
Wong, T.-T. 2012. A hybrid discretization method for nave bayesian classifiers. Pattern Recognition 45(6), 23212325.
Wu, X. & Zhu, X. 2008. Mining with noise knowledge: error-aware data mining. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans 38(4), 917932.
Yang, Y., Webb, G. I. & Wu, X. 2009. Discretization methods. In Data Mining and Knowledge Discovery Handbook. Springer, 101–116.
Zhang, S. 2011. Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123133.
Zhang, S., Jin, Z. & Zhu, X. 2011. Missing data imputation by utilizing information within incomplete instances. Journal of Systems and Software 84(3), 452459.
Zhang, S., Zhang, J., Zhu, X., Qin, Y. & Zhang, C. 2008. Missing value imputation based on data clustering. In Transactions on Computational Science I, Springer, 128–138.
Zhou, M.-J. & Chen, X.-J. 2012. An outlier mining algorithm based on dissimilarity. Procedia Environmental Sciences 12, 810814.
Zhou, Y., Chen, Y., Feng, L., Zhang, X., Shen, Z. & Zhou, X. 2018. Supervised and adaptive feature weighting for object-based classification on satellite images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11(9), 32243234.
Zhu, X. & Wu, X. 2004. Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review 22(3), 177210.

Data preprocessing in predictive data mining

  • Stamatios-Aggelos N. Alexandropoulos (a1), Sotiris B. Kotsiantis (a1) and Michael N. Vrahatis (a1)


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed