Data preprocessing in predictive data mining

Stamatios-Aggelos N. Alexandropoulos; Sotiris B. Kotsiantis; Michael N. Vrahatis

doi:10.1017/S026988891800036X

Data preprocessing in predictive data mining

Published online by Cambridge University Press: 09 January 2019

Stamatios-Aggelos N. Alexandropoulos

Sotiris B. Kotsiantis and

Michael N. Vrahatis

Show author details

Stamatios-Aggelos N. Alexandropoulos: Affiliation:
Computational Intelligence Laboratory (CILab), Department of Mathematics, University of Patras, GR-26110 Patras, Greece e-mail: alekst@math.upatras.gr, sotos@math.upatras.gr, vrahatis@math.upatras.gr
Sotiris B. Kotsiantis: Affiliation:
Computational Intelligence Laboratory (CILab), Department of Mathematics, University of Patras, GR-26110 Patras, Greece e-mail: alekst@math.upatras.gr, sotos@math.upatras.gr, vrahatis@math.upatras.gr
Michael N. Vrahatis: Affiliation:
Computational Intelligence Laboratory (CILab), Department of Mathematics, University of Patras, GR-26110 Patras, Greece e-mail: alekst@math.upatras.gr, sotos@math.upatras.gr, vrahatis@math.upatras.gr

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

A large variety of issues influence the success of data mining on a given problem. Two primary and important issues are the representation and the quality of the dataset. Specifically, if much redundant and unrelated or noisy and unreliable information is presented, then knowledge discovery becomes a very difficult problem. It is well-known that data preparation steps require significant processing time in machine learning tasks. It would be very helpful and quite useful if there were various preprocessing algorithms with the same reliable and effective performance across all datasets, but this is impossible. To this end, we present the most well-known and widely used up-to-date algorithms for each step of data preprocessing in the framework of predictive data mining.

Information

Type: Survey Article
Information: The Knowledge Engineering Review , Volume 34 , 2019 , e1

DOI: https://doi.org/10.1017/S026988891800036X [Opens in a new window]
Copyright: © Cambridge University Press, 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Aggarwal, C. C. 2013. An introduction to outlier analysis. In Outlier Analysis., Springer, 1–40.Google Scholar

Angiulli, F. & Fassetti, F. 2014. Exploiting domain knowledge to detect outliers. Data Mining and Knowledge Discovery 28(2), 519–568.Google Scholar

Angiulli, F. & Pizzuti, C. 2005. Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering 17(2), 203–215.Google Scholar

Aridas, C. K., Kotsiantis, S. B. & Vrahatis, M. N. 2016. Combining prototype selection with local boosting. In Artificial Intelligence Applications and Innovations (AIAI) 2016. IFIP Advances in Information and Communication Technology, Iliadis, L. & Maglogiannis, I. (eds). Springer, 475.Google Scholar

Aridas, C. K., Kotsiantis, S. B. & Vrahatis, M. N. 2017. Hybrid local boosting utilizing unlabeled data in classification tasks. Evolving Systems 1–11.Google Scholar

Arnaiz-González, Á., Dez-Pastor, J.-F., RodríGuez, J. J. & Garca-Osorio, C. 2016. Instance selection of linear complexity for big data. Knowledge-Based Systems 107, 83–95.Google Scholar

Augasta, M. G. & Kathirvalavakumar, T. 2012. A new discretization algorithm based on range coefficient of dispersion and skewness for neural networks classifier. Applied Soft Computing 12(2), 619–625.Google Scholar

Batista, G. E., Prati, R. C. & Monard, M. C. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1), 20–29.Google Scholar

Bezdek, J. C. & Kuncheva, L. I. 2001. Nearest prototype classifier designs: an experimental study. International Journal of Intelligent Systems 16(12), 1445–1473.Google Scholar

Boulle, M. 2004. KHIOPS: a statistical discretization method of continuous attributes. Machine Learning 55(1), 53–69.Google Scholar

Brodley, C. E. & Friedl, M. A. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11, 131–167.Google Scholar

Buzzi-Ferraris, G. & Manenti, F. 2011. Outlier detection in large data sets. Computers & Chemical Engineering 35(2), 388–390.Google Scholar

Caises, Y., González, A., Leyva, E. & Pérez, R. 2011. Combining instance selection methods based on data characterization: an approach to increase their effectiveness. Information Sciences 181(20), 4780–4798.Google Scholar

Cano, A., Nguyen, D. T., Ventura, S. & Cios, K. J. 2016. ur-CAIM: improved CAIM discretization for unbalanced and balanced data. Soft Computing 20(1), 173–188.Google Scholar

Cano, J.-R., Garca, S. & Herrera, F. 2008. Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes. Pattern Recognition Letters 29(16), 2156–2164.Google Scholar

Cano, J. R., Herrera, F. & Lozano, M. 2005. Strategies for scaling up evolutionary instance reduction algorithms for data mining. In Evolutionary Computation in Data Mining, Springer, 21–39.Google Scholar

Caruana, R. & de Sa, V. R. 2003. Benefitting from the variables that variable selection discards. Journal of Machine Learning Research 3(3), 1245–1264.Google Scholar

Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.Google Scholar

Chen, S., Wang, W. & van Zuylen, H. 2010a. A comparison of outlier detection algorithms for ITS data. Expert Systems with Applications 37(2), 1169–1178.Google Scholar

Chen, Y., Miao, D. & Zhang, H. 2010b. Neighborhood outlier detection. Expert Systems with Applications 37(12), 8745–8749.Google Scholar

Chow, T. W. & Huang, D. 2005. Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Transactions on Neural Networks 16(1), 213–224.Google Scholar

Cismondi, F., Fialho, A. S., Vieira, S. M., Reti, S. R., Sousa, J. M. & Finkelstein, S. N. 2013. Missing data in medical databases: impute, delete or classify? Artificial Intelligence in Medicine 58(1), 63–72.Google Scholar

Crone, S. F., Lessmann, S. & Stahlbock, R. 2006. The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. European Journal of Operational Research 173(3), 781–800.Google Scholar

Czarnowski, I. 2010. Prototype selection algorithms for distributed learning. Pattern Recognition 43(6), 2292–2300.Google Scholar

Czarnowski, I. 2012. Cluster-based instance selection for machine classification. Knowledge and Information Systems 30(1), 113–133.Google Scholar

de Haro-García, A. & García-Pedrajas, N. 2009. A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery 18(3), 392–418.Google Scholar

de Sá, C. R., Soares, C. & Knobbe, A. 2016. Entropy-based discretization methods for ranking data. Information Sciences 329, 921–936.Google Scholar

Delany, S. J., Segata, N. & Mac Namee, B. 2012. Profiling instances in noise reduction. Knowledge-Based Systems 31, 28–40.Google Scholar

Derrac, J., Garca, S. & Herrera, F. 2010a. IFS-CoCo: instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recognition 43(6), 2082–2105.Google Scholar

Derrac, J., Garca, S. & Herrera, F. 2010b. A survey on evolutionary instance selection and generation. International Journal of Applied Metaheuristic Computing 1, 60–92.Google Scholar

Dougherty, J., Kohavi, R. & Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995, 194–202. Elsevier.Google Scholar

Ekambaram, R., Fefilatyev, S., Shreve, M., Kramer, K., Hall, L. O., Goldgof, D. B. & Kasturi, R. 2016. Active cleaning of label noise. Pattern Recognition 51, 463–480.Google Scholar

Elomaa, T. & Rousu, J. 2004. Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining and Knowledge Discovery 8(2), 97–126.Google Scholar

Escalante, H. J. 2005. A comparison of outlier detection algorithms for machine learning. In Proceedings of the International Conference on Communications in Computing, 228–237.Google Scholar

Estabrooks, A., Jo, T. & Japkowicz, N. 2004. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20(1), 18–36.Google Scholar

Farhangfar, A., Kurgan, L. & Dy, J. 2008. Impact of imputation of missing values on classification error for discrete data. Pattern Recognition 41(12), 3692–3705.Google Scholar

Farhangfar, A., Kurgan, L. & Pedrycz, W. 2007. A novel framework for imputation of missing values in databases. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans 37(5), 692–709.Google Scholar

Farquad, M. & Bose, I. 2012. Preprocessing unbalanced data using support vector machine. Decision Support Systems 53(1), 226–233.Google Scholar

Fernández, A., Carmona, C. J., del Jesus, M. & Herrera, F. 2017. A pareto based ensemble with feature and instance selection for learning from multi-class imbalanced datasets. International Journal of Neural Systems 27, 1–17.Google Scholar

Filzmoser, P., Maronna, R. & Werner, M. 2008. Outlier identification in high dimensions. Computational Statistics & Data Analysis 52(3), 1694–1711.Google Scholar

Flores, M. J., Gámez, J. A., Martnez, A. M. & Puerta, J. M. 2011. Handling numeric attributes when comparing bayesian network classifiers: does the discretization method matter? Applied Intelligence 34(3), 372–385.Google Scholar

Galar, M., Fernández, A., Barrenechea, E., Bustince, H. & Herrera, F. 2011. An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition 44(8), 1761–1776.Google Scholar

Garcia, L. P., de Carvalho, A. C. & Lorena, A. C. 2016a. Noise detection in the meta-learning level. Neurocomputing 176, 14–25.Google Scholar

Garcia, L. P., Lorena, A. C., Matwin, S. & de Carvalho, A. C. 2016b. Ensembles of label noise filters: a ranking approach. Data Mining and Knowledge Discovery 30(5), 1192–1216.Google Scholar

García, S., Cano, J. R. & Herrera, F. 2008. A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recognition 41(8), 2693–2709.Google Scholar

García, S., Derrac, J., Cano, J. & Herrera, F. 2012a. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3), 417–435.Google Scholar

García, S., Luengo, J. & Herrera, F. 2015. Data Preprocessing in Data Mining. Springer.Google Scholar

García, S., Luengo, J., Sáez, J. A., Lopez, V. & Herrera, F. 2013. A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25(4), 734–750.Google Scholar

García, V., Sánchez, J. S. & Mollineda, R. A. 2012b. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems 25(1), 13–21.Google Scholar

García-Pedrajas, N., Del Castillo, J. A. R. & Ortiz-Boyer, D. 2010. A cooperative coevolutionary algorithm for instance selection for instance-based learning. Machine Learning 78(3), 381–420.Google Scholar

GarcíA-Pedrajas, N. & PéRez-RodríGuez, J. 2012. Multi-selection of instances: a straightforward way to improve evolutionary instance selection. Applied Soft Computing 12(11), 3590–3602.Google Scholar

Ghoting, A., Parthasarathy, S. & Otey, M. E. 2008. Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery 16(3), 349–364.Google Scholar

Gonzalez-Abril, L., Cuberos, F. J., Velasco, F. & Ortega, J. A. 2009. AMEVA: an autonomous discretization algorithm. Expert Systems with Applications 36(3), 5327–5332.Google Scholar

Gupta, A., Mehrotra, K. G. & Mohan, C. 2010. A clustering-based discretization for supervised learning. Statistics & Probability Letters 80(9), 816–824.Google Scholar

Guyon, I. & Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3(3), 1157–1182.Google Scholar

He, H. & Garcia, E. A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284.Google Scholar

Hernandez-Leal, P., Carrasco-Ochoa, J. A., Martnez-Trinidad, J. F. & Olvera-Lopez, J. A. 2013. Instancerank based on borders for instance selection. Pattern Recognition 46(1), 365–375.Google Scholar

Hinton, G. E. & Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507.Google Scholar

Hoffmann, H. 2007. Kernel PCA for novelty detection. Pattern Recognition 40(3), 863–874.Google Scholar

Honghai, F., Guoshun, C., Cheng, Y., Bingru, Y. & Yumei, C. 2005. A SVM regression based approach to filling in missing values. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, 581–587. Springer.Google Scholar

Hu, Q., Che, X., Zhang, L. & Yu, D. 2010. Feature evaluation and selection based on neighborhood soft margin. Neurocomputing 73(10), 2114–2124.Google Scholar

Hua, J., Tembe, W. D. & Dougherty, E. R. 2009. Performance of feature-selection methods in the classification of high-dimension data. Pattern Recognition 42(3), 409–424.Google Scholar

Hua, J., Xiong, Z., Lowey, J., Suh, E. & Dougherty, E. R. 2005. Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21(8), 1509–1515.Google Scholar

Huang, H., Lin, J., Chen, C. & Fan, M. 2006. Review of outlier detection. Application Research of Computers 8, 2006–2008.Google Scholar

Janssens, D., Brijs, T., Vanhoof, K. & Wets, G. 2006. Evaluating the performance of cost-based discretization versus entropy-and error-based discretization. Computers & Operations Research 33(11), 3107–3123.Google Scholar

Jerez, J. M., Molina, I., Garca-Laencina, P. J., Alba, E., Ribelles, N., Martn, M. & Franco, L. 2010. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine 50(2), 105–115.Google Scholar

Jin, R., Breitbart, Y. & Muoh, C. 2009. Data discretization unification. Knowledge and Information Systems 19(1), 1–29.Google Scholar

Kennedy, J. & Eberhart, R. C. 1995. Particle swarm optimization. In IEEE International Conference on Neural Networks Proceedings 1995, 4, 1942–1948. IEEE.Google Scholar

Kim, S., Cho, N. W., Kang, B. & Kang, S.-H. 2011. Fast outlier detection for very large log data. Expert Systems with Applications 38(8), 9587–9596.Google Scholar

Kim, S.-W. & Oommen, B. J. 2003. A brief taxonomy and ranking of creative prototype reduction schemes. Pattern Analysis & Applications 6(3), 232–244.Google Scholar

Klinkenberg, R. 2004. Learning drifting concepts: example selection vs. example weighting. Intelligent Data Analysis 8(3), 281–300.Google Scholar

Kumar, A. & Zhang, D. 2007. Hand-geometry recognition using entropy-based discretization. IEEE Transactions on Information Forensics and Security 2(2), 181–187.Google Scholar

Kurgan, L. A. & Cios, K. J. 2004. CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering 16(2), 145–153.Google Scholar

Lemaître, G., Nogueira, F. & Aridas, C. K. 2017. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18(17), 1–5.Google Scholar

Li, M., Deng, S., Feng, S. & Fan, J. 2011. An effective discretization based on class-attribute coherence maximization. Pattern Recognition Letters 32(15), 1962–1973.Google Scholar

Lin, W.-C., Tsai, C.-F., Ke, S.-W., Hung, C.-W. & Eberle, W. 2015. Learning to detect representative data for large scale instance selection. Journal of Systems and Software 106, 1–8.Google Scholar

Liu, C., Wang, W., Wang, M., Lv, F. & Konan, M. 2017. An efficient instance selection algorithm to reconstruct training set for support vector machine. Knowledge-Based Systems 116, 58–73.Google Scholar

Liu, F. T., Ting, K. M. & Zhou, Z.-H. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6(1), 1–39.Google Scholar

Liu, H., Hussain, F., Tan, C. L. & Dash, M. 2002. Discretization: an enabling technique. Data Mining and Knowledge Discovery 6(4), 393–423.Google Scholar

Liu, H. & Motoda, H. 2002. On issues of instance selection. Data Mining and Knowledge Discovery 6(2), 115–130.Google Scholar

Liu, H. & Motoda, H. 2007. Computational Methods of Feature Selection. CRC Press.Google Scholar

Liu, X. & Wang, H. 2005. A discretization algorithm based on a heterogeneity criterion. IEEE Transactions on Knowledge and Data Engineering 17(9), 1166–1173.Google Scholar

Liu, Z.-G., Pan, Q., Dezert, J. & Martin, A. 2016. Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognition 52, 85–95.Google Scholar

Lobato, F., Sales, C., Araujo, I., Tadaiesky, V., Dias, L., Ramos, L. & Santana, A. 2015. Multi-objective genetic algorithm for missing data imputation. Pattern Recognition Letters 68, 126–131.Google Scholar

López, V., Fernández, A., Moreno-Torres, J. G. & Herrera, F. 2012. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Systems with Applications 39(7), 6585–6608.Google Scholar

Losing, V., Hammer, B. & Wersing, H. 2018. Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275, 1261–1274.Google Scholar

Luengo, J., Garca, S. & Herrera, F. 2012. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems 32(1), 77–108.Google Scholar

Mahanipour, A., Nezamabadi-pour, H. & Nikpour, B. 2018. Using fuzzy-rough set feature selection for feature construction based on genetic programming. In 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), 1–6. IEEE.Google Scholar

Maldonado, S. & Weber, R. 2009. A wrapper method for feature selection using support vector machines. Information Sciences 179(13), 2208–2217.Google Scholar

Mao, K. 2004. Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34(1), 629–634.Google Scholar

Marchiori, E. 2008. Hit miss networks with applications to instance selection. Journal of Machine Learning Research 9(6), 997–1017.Google Scholar

Mateos-García, D., García-Gutiérrez, J. & Riquelme-Santos, J. C. 2012. On the evolutionary optimization of k-NN by label-dependent feature weighting. Pattern Recognition Letters 33(16), 2232–2238.Google Scholar

Mizianty, M. J., Kurgan, L. A. & Ogiela, M. R. 2010. Discretization as the enabling technique for the naive Bayes and semi-naive Bayes-based classification. The Knowledge Engineering Review 25(04), 421–449.Google Scholar

Moreno-Torres, J. G., Raeder, T., Alaiz-RodríGuez, R., Chawla, N. V. & Herrera, F. 2012. A unifying view on dataset shift in classification. Pattern Recognition 45(1), 521–530.Google Scholar

Nanni, L. & Lumini, A. 2011. Prototype reduction techniques: a comparison among different approaches. Expert Systems with Applications 38(9), 11820–11828.Google Scholar

Nikolaidis, K., Goulermas, J. Y. & Wu, Q. 2011. A class boundary preserving algorithm for data condensation. Pattern Recognition 44(3), 704–715.Google Scholar

Olvera-López, J. A., Carrasco-Ochoa, J. A., Martnez-Trinidad, J. F. & Kittler, J. 2010. A review of instance selection methods. Artificial Intelligence Review 34(2), 133–143.Google Scholar

Panday, D., de Amorim, R. C. & Lane, P. 2018. Feature weighting as a tool for unsupervised feature selection. Information Processing Letters 129, 44–52.Google Scholar

Park, D.-C. 2009. Centroid neural network with weighted features. Journal of Circuits, Systems, and Computers 18(08), 1353–1367.Google Scholar

Parsopoulos, K. E. & Vrahatis, M. N. 2010. Particle Swarm Optimization and Intelligence: Advances and Applications. Information Science Publishing (IGI Global).Google Scholar

Pearson, R. K. 2005. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. SIAM.Google Scholar

Piramuthu, S. 2004. Evaluating feature selection methods for learning in data mining applications. European Journal of Operational Research 156(2), 483–494.Google Scholar

Piramuthu, S. & Sikora, R. T. 2009. Iterative feature construction for improving inductive learning algorithms. Expert Systems with Applications 36(2), 3401–3406.Google Scholar

Pkekalska, E., Duin, R. P. & Paclk, P. 2006. Prototype selection for dissimilarity-based classifiers. Pattern Recognition 39(2), 189–208.Google Scholar

Pyle, D. 1999. Data Preparation for Data Mining, 1. Morgan Kaufmann.Google Scholar

Qin, Y., Zhang, S., Zhu, X., Zhang, J. & Zhang, C. 2009. POP algorithm: kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications 36(2), 2794–2804.Google Scholar

Quionero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. 2009. Dataset Shift in Machine Learning. MIT Press.Google Scholar

Ramírez-Gallego, S., Garca, S., Mouriño-Taln, H., Martnez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Bentez, J. M. & Herrera, F. 2016. Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 6(1), 5–21.Google Scholar

Ramírez-Gallego, S., Krawczyk, B., Garca, S., Woźniak, M. & Herrera, F. 2017. A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 39–56.Google Scholar

Reinartz, T. 2002. A unifying view on instance selection. Data Mining and Knowledge Discovery 6(2), 191–210.Google Scholar

Sáez, J. A., Galar, M., Luengo, J. & Herrera, F. 2016. INFFC: an iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Information Fusion 27, 19–32.Google Scholar

Sáez, J. A., Luengo, J. & Herrera, F. 2013. Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognition 46(1), 355–364.Google Scholar

Segata, N., Blanzieri, E., Delany, S. J. & Cunningham, P. 2010. Noise reduction for instance-based learning with a local maximal margin approach. Journal of Intelligent Information Systems 35(2), 301–331.Google Scholar

Shen, C., Wang, X. & Yu, D. 2012. Feature weighting of support vector machines based on derivative saliency analysis and its application to financial data mining. International Journal of Advancements in Computing Technology 4(1), 199–206.Google Scholar

Shu, W. & Shen, H. 2016. Multi-criteria feature selection on cost-sensitive data with missing values. Pattern Recognition 51, 268–280.Google Scholar

Silva-Ramírez, E.-L., Pino-Mejas, R., López-Coello, M. & Cubiles-de-la Vega, M.-D. 2011. Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1), 121–129.Google Scholar

Sim, J., Kwon, O. & Lee, K. C. 2016. Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets. Expert Systems with Applications 46, 485–493.Google Scholar

Skillicorn, D. B. & McConnell, S. M. 2008. Distributed prediction from vertically partitioned data. Journal of Parallel and Distributed Computing 68(1), 16–36.Google Scholar

Smith, M. G. & Bull, L. 2005. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines 6(3), 265–281.Google Scholar

Somol, P. & Pudil, P. 2002. Feature selection toolbox. Pattern Recognition 35(12), 2749–2759.Google Scholar

Sun, Y., Wong, A. K. & Kamel, M. S. 2009. Classification of imbalanced data: a review. International Journal of Pattern Recognition and Artificial Intelligence 23(04), 687–719.Google Scholar

Triguero, I., Derrac, J., Garcia, S. & Herrera, F. 2012a. Integrating a differential evolution feature weighting scheme into prototype generation. Neurocomputing 97, 332–343.Google Scholar

Triguero, I., Derrac, J., Garcia, S. & Herrera, F. 2012b. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(1), 86–100.Google Scholar

Tsai, C.-F. & Chang, F.-Y. 2016. Combining instance selection for better missing value imputation. Journal of Systems and Software 122, 63–71.Google Scholar

Tsai, C.-J., Lee, C.-I. & Yang, W.-P. 2008. A discretization algorithm based on class-attribute contingency coefficient. Information Sciences 178(3), 714–731.Google Scholar

Unler, A., Murat, A. & Chinnam, R. B. 2011. mr 2 PSO: a maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Information Sciences 181(20), 4625–4641.Google Scholar

van Hulse, J. & Khoshgoftaar, T. 2009. Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68(12), 1513–1542.Google Scholar

van Hulse, J. & Khoshgoftaar, T. M. 2006. Class noise detection using frequent itemsets. Intelligent Data Analysis 10(6), 487–507.Google Scholar

van Hulse, J. D., Khoshgoftaar, T. M. & Huang, H. 2007. The pairwise attribute noise detection algorithm. Knowledge and Information Systems 11(2), 171–190.Google Scholar

Virgolin, M., Alderliesten, T., Bel, A., Witteveen, C. & Bosman, P. A. 2018. Symbolic regression and feature construction with gp-gomea applied to radiotherapy dose reconstruction of childhood cancer survivors. In Proceedings of the Genetic and Evolutionary Computation Conference, 1395–1402. ACM.Google Scholar

Wang, B. & Japkowicz, N. 2004. Imbalanced data set learning with synthetic samples. In Proc. IRIS Machine Learning Workshop, 19.Google Scholar

Wettschereck, D., Aha, D. W. & Mohri, T. 1997. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review 11(1–5), 273–314.Google Scholar

Wilson, D. R. & Martinez, T. R. 2000. Reduction techniques for instance-based learning algorithms. Machine Learning 38(3), 257–286.Google Scholar

Witten, I. H., Frank, E., Hall, M. A. & Pal, C. J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.Google Scholar

Wong, T.-T. 2012. A hybrid discretization method for nave bayesian classifiers. Pattern Recognition 45(6), 2321–2325.Google Scholar

Wu, X. & Zhu, X. 2008. Mining with noise knowledge: error-aware data mining. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans 38(4), 917–932.Google Scholar

Yang, Y., Webb, G. I. & Wu, X. 2009. Discretization methods. In Data Mining and Knowledge Discovery Handbook. Springer, 101–116.Google Scholar

Zhang, S. 2011. Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123–133.Google Scholar

Zhang, S., Jin, Z. & Zhu, X. 2011. Missing data imputation by utilizing information within incomplete instances. Journal of Systems and Software 84(3), 452–459.Google Scholar

Zhang, S., Zhang, J., Zhu, X., Qin, Y. & Zhang, C. 2008. Missing value imputation based on data clustering. In Transactions on Computational Science I, Springer, 128–138.Google Scholar

Zhou, M.-J. & Chen, X.-J. 2012. An outlier mining algorithm based on dissimilarity. Procedia Environmental Sciences 12, 810–814.Google Scholar

Zhou, Y., Chen, Y., Feng, L., Zhang, X., Shen, Z. & Zhou, X. 2018. Supervised and adaptive feature weighting for object-based classification on satellite images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11(9), 3224–3234.Google Scholar

Zhu, X. & Wu, X. 2004. Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review 22(3), 177–210.Google Scholar

Article contents

Data preprocessing in predictive data mining

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests