Skip to main content Accessibility help
×
Home
Hostname: page-component-684899dbb8-rbzxz Total loading time: 0.546 Render date: 2022-05-20T13:31:38.908Z Has data issue: true Feature Flags: { "shouldUseShareProductTool": true, "shouldUseHypothesis": true, "isUnsiloEnabled": true, "useRatesEcommerce": false, "useNewApi": true }

Data preprocessing in predictive data mining

Published online by Cambridge University Press:  09 January 2019

Stamatios-Aggelos N. Alexandropoulos
Affiliation:
Computational Intelligence Laboratory (CILab), Department of Mathematics, University of Patras, GR-26110 Patras, Greece e-mail: alekst@math.upatras.gr, sotos@math.upatras.gr, vrahatis@math.upatras.gr
Sotiris B. Kotsiantis
Affiliation:
Computational Intelligence Laboratory (CILab), Department of Mathematics, University of Patras, GR-26110 Patras, Greece e-mail: alekst@math.upatras.gr, sotos@math.upatras.gr, vrahatis@math.upatras.gr
Michael N. Vrahatis
Affiliation:
Computational Intelligence Laboratory (CILab), Department of Mathematics, University of Patras, GR-26110 Patras, Greece e-mail: alekst@math.upatras.gr, sotos@math.upatras.gr, vrahatis@math.upatras.gr

Abstract

A large variety of issues influence the success of data mining on a given problem. Two primary and important issues are the representation and the quality of the dataset. Specifically, if much redundant and unrelated or noisy and unreliable information is presented, then knowledge discovery becomes a very difficult problem. It is well-known that data preparation steps require significant processing time in machine learning tasks. It would be very helpful and quite useful if there were various preprocessing algorithms with the same reliable and effective performance across all datasets, but this is impossible. To this end, we present the most well-known and widely used up-to-date algorithms for each step of data preprocessing in the framework of predictive data mining.

Type
Survey Article
Copyright
© Cambridge University Press, 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aggarwal, C. C. 2013. An introduction to outlier analysis. In Outlier Analysis., Springer, 1–40.Google Scholar
Angiulli, F. & Fassetti, F. 2014. Exploiting domain knowledge to detect outliers. Data Mining and Knowledge Discovery 28(2), 519568.CrossRefGoogle Scholar
Angiulli, F. & Pizzuti, C. 2005. Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering 17(2), 203215.CrossRefGoogle Scholar
Aridas, C. K., Kotsiantis, S. B. & Vrahatis, M. N. 2016. Combining prototype selection with local boosting. In Artificial Intelligence Applications and Innovations (AIAI) 2016. IFIP Advances in Information and Communication Technology, Iliadis, L. & Maglogiannis, I. (eds). Springer, 475.Google Scholar
Aridas, C. K., Kotsiantis, S. B. & Vrahatis, M. N. 2017. Hybrid local boosting utilizing unlabeled data in classification tasks. Evolving Systems 111.Google Scholar
Arnaiz-González, Á., Dez-Pastor, J.-F., RodríGuez, J. J. & Garca-Osorio, C. 2016. Instance selection of linear complexity for big data. Knowledge-Based Systems 107, 8395.CrossRefGoogle Scholar
Augasta, M. G. & Kathirvalavakumar, T. 2012. A new discretization algorithm based on range coefficient of dispersion and skewness for neural networks classifier. Applied Soft Computing 12(2), 619625.CrossRefGoogle Scholar
Batista, G. E., Prati, R. C. & Monard, M. C. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1), 2029.CrossRefGoogle Scholar
Bezdek, J. C. & Kuncheva, L. I. 2001. Nearest prototype classifier designs: an experimental study. International Journal of Intelligent Systems 16(12), 14451473.CrossRefGoogle Scholar
Boulle, M. 2004. KHIOPS: a statistical discretization method of continuous attributes. Machine Learning 55(1), 5369.CrossRefGoogle Scholar
Brodley, C. E. & Friedl, M. A. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11, 131167.CrossRefGoogle Scholar
Buzzi-Ferraris, G. & Manenti, F. 2011. Outlier detection in large data sets. Computers & Chemical Engineering 35(2), 388390.CrossRefGoogle Scholar
Caises, Y., González, A., Leyva, E. & Pérez, R. 2011. Combining instance selection methods based on data characterization: an approach to increase their effectiveness. Information Sciences 181(20), 47804798.CrossRefGoogle Scholar
Cano, A., Nguyen, D. T., Ventura, S. & Cios, K. J. 2016. ur-CAIM: improved CAIM discretization for unbalanced and balanced data. Soft Computing 20(1), 173188.CrossRefGoogle Scholar
Cano, J.-R., Garca, S. & Herrera, F. 2008. Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes. Pattern Recognition Letters 29(16), 21562164.CrossRefGoogle Scholar
Cano, J. R., Herrera, F. & Lozano, M. 2005. Strategies for scaling up evolutionary instance reduction algorithms for data mining. In Evolutionary Computation in Data Mining, Springer, 21–39.Google Scholar
Caruana, R. & de Sa, V. R. 2003. Benefitting from the variables that variable selection discards. Journal of Machine Learning Research 3(3), 12451264.Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321357.CrossRefGoogle Scholar
Chen, S., Wang, W. & van Zuylen, H. 2010a. A comparison of outlier detection algorithms for ITS data. Expert Systems with Applications 37(2), 11691178.CrossRefGoogle Scholar
Chen, Y., Miao, D. & Zhang, H. 2010b. Neighborhood outlier detection. Expert Systems with Applications 37(12), 87458749.CrossRefGoogle Scholar
Chow, T. W. & Huang, D. 2005. Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Transactions on Neural Networks 16(1), 213224.CrossRefGoogle ScholarPubMed
Cismondi, F., Fialho, A. S., Vieira, S. M., Reti, S. R., Sousa, J. M. & Finkelstein, S. N. 2013. Missing data in medical databases: impute, delete or classify? Artificial Intelligence in Medicine 58(1), 6372.CrossRefGoogle ScholarPubMed
Crone, S. F., Lessmann, S. & Stahlbock, R. 2006. The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. European Journal of Operational Research 173(3), 781800.CrossRefGoogle Scholar
Czarnowski, I. 2010. Prototype selection algorithms for distributed learning. Pattern Recognition 43(6), 22922300.CrossRefGoogle Scholar
Czarnowski, I. 2012. Cluster-based instance selection for machine classification. Knowledge and Information Systems 30(1), 113133.CrossRefGoogle Scholar
de Haro-García, A. & García-Pedrajas, N. 2009. A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery 18(3), 392418.CrossRefGoogle Scholar
de Sá, C. R., Soares, C. & Knobbe, A. 2016. Entropy-based discretization methods for ranking data. Information Sciences 329, 921936.CrossRefGoogle Scholar
Delany, S. J., Segata, N. & Mac Namee, B. 2012. Profiling instances in noise reduction. Knowledge-Based Systems 31, 2840.CrossRefGoogle Scholar
Derrac, J., Garca, S. & Herrera, F. 2010a. IFS-CoCo: instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recognition 43(6), 20822105.CrossRefGoogle Scholar
Derrac, J., Garca, S. & Herrera, F. 2010b. A survey on evolutionary instance selection and generation. International Journal of Applied Metaheuristic Computing 1, 6092.CrossRefGoogle Scholar
Dougherty, J., Kohavi, R. & Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995, 194–202. Elsevier.Google Scholar
Ekambaram, R., Fefilatyev, S., Shreve, M., Kramer, K., Hall, L. O., Goldgof, D. B. & Kasturi, R. 2016. Active cleaning of label noise. Pattern Recognition 51, 463480.CrossRefGoogle Scholar
Elomaa, T. & Rousu, J. 2004. Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining and Knowledge Discovery 8(2), 97126.CrossRefGoogle Scholar
Escalante, H. J. 2005. A comparison of outlier detection algorithms for machine learning. In Proceedings of the International Conference on Communications in Computing, 228–237.Google Scholar
Estabrooks, A., Jo, T. & Japkowicz, N. 2004. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20(1), 1836.CrossRefGoogle Scholar
Farhangfar, A., Kurgan, L. & Dy, J. 2008. Impact of imputation of missing values on classification error for discrete data. Pattern Recognition 41(12), 36923705.CrossRefGoogle Scholar
Farhangfar, A., Kurgan, L. & Pedrycz, W. 2007. A novel framework for imputation of missing values in databases. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans 37(5), 692709.CrossRefGoogle Scholar
Farquad, M. & Bose, I. 2012. Preprocessing unbalanced data using support vector machine. Decision Support Systems 53(1), 226233.CrossRefGoogle Scholar
Fernández, A., Carmona, C. J., del Jesus, M. & Herrera, F. 2017. A pareto based ensemble with feature and instance selection for learning from multi-class imbalanced datasets. International Journal of Neural Systems 27, 117.CrossRefGoogle ScholarPubMed
Filzmoser, P., Maronna, R. & Werner, M. 2008. Outlier identification in high dimensions. Computational Statistics & Data Analysis 52(3), 16941711.CrossRefGoogle Scholar
Flores, M. J., Gámez, J. A., Martnez, A. M. & Puerta, J. M. 2011. Handling numeric attributes when comparing bayesian network classifiers: does the discretization method matter? Applied Intelligence 34(3), 372385.CrossRefGoogle Scholar
Galar, M., Fernández, A., Barrenechea, E., Bustince, H. & Herrera, F. 2011. An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition 44(8), 17611776.CrossRefGoogle Scholar
Garcia, L. P., de Carvalho, A. C. & Lorena, A. C. 2016a. Noise detection in the meta-learning level. Neurocomputing 176, 1425.CrossRefGoogle Scholar
Garcia, L. P., Lorena, A. C., Matwin, S. & de Carvalho, A. C. 2016b. Ensembles of label noise filters: a ranking approach. Data Mining and Knowledge Discovery 30(5), 11921216.CrossRefGoogle Scholar
García, S., Cano, J. R. & Herrera, F. 2008. A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recognition 41(8), 26932709.CrossRefGoogle Scholar
García, S., Derrac, J., Cano, J. & Herrera, F. 2012a. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3), 417435.CrossRefGoogle Scholar
García, S., Luengo, J. & Herrera, F. 2015. Data Preprocessing in Data Mining. Springer.CrossRefGoogle Scholar
García, S., Luengo, J., Sáez, J. A., Lopez, V. & Herrera, F. 2013. A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25(4), 734750.CrossRefGoogle Scholar
García, V., Sánchez, J. S. & Mollineda, R. A. 2012b. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems 25(1), 1321.CrossRefGoogle Scholar
García-Pedrajas, N., Del Castillo, J. A. R. & Ortiz-Boyer, D. 2010. A cooperative coevolutionary algorithm for instance selection for instance-based learning. Machine Learning 78(3), 381420.CrossRefGoogle Scholar
GarcíA-Pedrajas, N. & PéRez-RodríGuez, J. 2012. Multi-selection of instances: a straightforward way to improve evolutionary instance selection. Applied Soft Computing 12(11), 35903602.CrossRefGoogle Scholar
Ghoting, A., Parthasarathy, S. & Otey, M. E. 2008. Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery 16(3), 349364.CrossRefGoogle Scholar
Gonzalez-Abril, L., Cuberos, F. J., Velasco, F. & Ortega, J. A. 2009. AMEVA: an autonomous discretization algorithm. Expert Systems with Applications 36(3), 53275332.CrossRefGoogle Scholar
Gupta, A., Mehrotra, K. G. & Mohan, C. 2010. A clustering-based discretization for supervised learning. Statistics & Probability Letters 80(9), 816824.CrossRefGoogle Scholar
Guyon, I. & Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3(3), 11571182.Google Scholar
He, H. & Garcia, E. A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 12631284.Google Scholar
Hernandez-Leal, P., Carrasco-Ochoa, J. A., Martnez-Trinidad, J. F. & Olvera-Lopez, J. A. 2013. Instancerank based on borders for instance selection. Pattern Recognition 46(1), 365375.CrossRefGoogle Scholar
Hinton, G. E. & Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786), 504507.CrossRefGoogle ScholarPubMed
Hoffmann, H. 2007. Kernel PCA for novelty detection. Pattern Recognition 40(3), 863874.CrossRefGoogle Scholar
Honghai, F., Guoshun, C., Cheng, Y., Bingru, Y. & Yumei, C. 2005. A SVM regression based approach to filling in missing values. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, 581–587. Springer.Google Scholar
Hu, Q., Che, X., Zhang, L. & Yu, D. 2010. Feature evaluation and selection based on neighborhood soft margin. Neurocomputing 73(10), 21142124.CrossRefGoogle Scholar
Hua, J., Tembe, W. D. & Dougherty, E. R. 2009. Performance of feature-selection methods in the classification of high-dimension data. Pattern Recognition 42(3), 409424.CrossRefGoogle Scholar
Hua, J., Xiong, Z., Lowey, J., Suh, E. & Dougherty, E. R. 2005. Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21(8), 15091515.CrossRefGoogle ScholarPubMed
Huang, H., Lin, J., Chen, C. & Fan, M. 2006. Review of outlier detection. Application Research of Computers 8, 20062008.Google Scholar
Janssens, D., Brijs, T., Vanhoof, K. & Wets, G. 2006. Evaluating the performance of cost-based discretization versus entropy-and error-based discretization. Computers & Operations Research 33(11), 31073123.CrossRefGoogle Scholar
Jerez, J. M., Molina, I., Garca-Laencina, P. J., Alba, E., Ribelles, N., Martn, M. & Franco, L. 2010. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine 50(2), 105115.CrossRefGoogle Scholar
Jin, R., Breitbart, Y. & Muoh, C. 2009. Data discretization unification. Knowledge and Information Systems 19(1), 129.CrossRefGoogle Scholar
Kennedy, J. & Eberhart, R. C. 1995. Particle swarm optimization. In IEEE International Conference on Neural Networks Proceedings 1995, 4, 1942–1948. IEEE.Google Scholar
Kim, S., Cho, N. W., Kang, B. & Kang, S.-H. 2011. Fast outlier detection for very large log data. Expert Systems with Applications 38(8), 95879596.CrossRefGoogle Scholar
Kim, S.-W. & Oommen, B. J. 2003. A brief taxonomy and ranking of creative prototype reduction schemes. Pattern Analysis & Applications 6(3), 232244.CrossRefGoogle Scholar
Klinkenberg, R. 2004. Learning drifting concepts: example selection vs. example weighting. Intelligent Data Analysis 8(3), 281300.CrossRefGoogle Scholar
Kumar, A. & Zhang, D. 2007. Hand-geometry recognition using entropy-based discretization. IEEE Transactions on Information Forensics and Security 2(2), 181187.CrossRefGoogle Scholar
Kurgan, L. A. & Cios, K. J. 2004. CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering 16(2), 145153.CrossRefGoogle Scholar
Lemaître, G., Nogueira, F. & Aridas, C. K. 2017. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18(17), 15.Google Scholar
Li, M., Deng, S., Feng, S. & Fan, J. 2011. An effective discretization based on class-attribute coherence maximization. Pattern Recognition Letters 32(15), 19621973.CrossRefGoogle Scholar
Lin, W.-C., Tsai, C.-F., Ke, S.-W., Hung, C.-W. & Eberle, W. 2015. Learning to detect representative data for large scale instance selection. Journal of Systems and Software 106, 18.CrossRefGoogle Scholar
Liu, C., Wang, W., Wang, M., Lv, F. & Konan, M. 2017. An efficient instance selection algorithm to reconstruct training set for support vector machine. Knowledge-Based Systems 116, 5873.CrossRefGoogle Scholar
Liu, F. T., Ting, K. M. & Zhou, Z.-H. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6(1), 139.CrossRefGoogle Scholar
Liu, H., Hussain, F., Tan, C. L. & Dash, M. 2002. Discretization: an enabling technique. Data Mining and Knowledge Discovery 6(4), 393423.CrossRefGoogle Scholar
Liu, H. & Motoda, H. 2002. On issues of instance selection. Data Mining and Knowledge Discovery 6(2), 115130.CrossRefGoogle Scholar
Liu, H. & Motoda, H. 2007. Computational Methods of Feature Selection. CRC Press.Google Scholar
Liu, X. & Wang, H. 2005. A discretization algorithm based on a heterogeneity criterion. IEEE Transactions on Knowledge and Data Engineering 17(9), 11661173.CrossRefGoogle Scholar
Liu, Z.-G., Pan, Q., Dezert, J. & Martin, A. 2016. Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognition 52, 8595.CrossRefGoogle Scholar
Lobato, F., Sales, C., Araujo, I., Tadaiesky, V., Dias, L., Ramos, L. & Santana, A. 2015. Multi-objective genetic algorithm for missing data imputation. Pattern Recognition Letters 68, 126131.CrossRefGoogle Scholar
López, V., Fernández, A., Moreno-Torres, J. G. & Herrera, F. 2012. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Systems with Applications 39(7), 65856608.CrossRefGoogle Scholar
Losing, V., Hammer, B. & Wersing, H. 2018. Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275, 12611274.CrossRefGoogle Scholar
Luengo, J., Garca, S. & Herrera, F. 2012. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems 32(1), 77108.CrossRefGoogle Scholar
Mahanipour, A., Nezamabadi-pour, H. & Nikpour, B. 2018. Using fuzzy-rough set feature selection for feature construction based on genetic programming. In 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), 1–6. IEEE.Google Scholar
Maldonado, S. & Weber, R. 2009. A wrapper method for feature selection using support vector machines. Information Sciences 179(13), 22082217.CrossRefGoogle Scholar
Mao, K. 2004. Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34(1), 629634.CrossRefGoogle ScholarPubMed
Marchiori, E. 2008. Hit miss networks with applications to instance selection. Journal of Machine Learning Research 9(6), 9971017.Google Scholar
Mateos-García, D., García-Gutiérrez, J. & Riquelme-Santos, J. C. 2012. On the evolutionary optimization of k-NN by label-dependent feature weighting. Pattern Recognition Letters 33(16), 22322238.CrossRefGoogle Scholar
Mizianty, M. J., Kurgan, L. A. & Ogiela, M. R. 2010. Discretization as the enabling technique for the naive Bayes and semi-naive Bayes-based classification. The Knowledge Engineering Review 25(04), 421449.CrossRefGoogle Scholar
Moreno-Torres, J. G., Raeder, T., Alaiz-RodríGuez, R., Chawla, N. V. & Herrera, F. 2012. A unifying view on dataset shift in classification. Pattern Recognition 45(1), 521530.CrossRefGoogle Scholar
Nanni, L. & Lumini, A. 2011. Prototype reduction techniques: a comparison among different approaches. Expert Systems with Applications 38(9), 1182011828.CrossRefGoogle Scholar
Nikolaidis, K., Goulermas, J. Y. & Wu, Q. 2011. A class boundary preserving algorithm for data condensation. Pattern Recognition 44(3), 704715.CrossRefGoogle Scholar
Olvera-López, J. A., Carrasco-Ochoa, J. A., Martnez-Trinidad, J. F. & Kittler, J. 2010. A review of instance selection methods. Artificial Intelligence Review 34(2), 133143.CrossRefGoogle Scholar
Panday, D., de Amorim, R. C. & Lane, P. 2018. Feature weighting as a tool for unsupervised feature selection. Information Processing Letters 129, 4452.CrossRefGoogle Scholar
Park, D.-C. 2009. Centroid neural network with weighted features. Journal of Circuits, Systems, and Computers 18(08), 13531367.CrossRefGoogle Scholar
Parsopoulos, K. E. & Vrahatis, M. N. 2010. Particle Swarm Optimization and Intelligence: Advances and Applications. Information Science Publishing (IGI Global).CrossRefGoogle Scholar
Pearson, R. K. 2005. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. SIAM.CrossRefGoogle Scholar
Piramuthu, S. 2004. Evaluating feature selection methods for learning in data mining applications. European Journal of Operational Research 156(2), 483494.CrossRefGoogle Scholar
Piramuthu, S. & Sikora, R. T. 2009. Iterative feature construction for improving inductive learning algorithms. Expert Systems with Applications 36(2), 34013406.CrossRefGoogle Scholar
Pkekalska, E., Duin, R. P. & Paclk, P. 2006. Prototype selection for dissimilarity-based classifiers. Pattern Recognition 39(2), 189208.CrossRefGoogle Scholar
Pyle, D. 1999. Data Preparation for Data Mining, 1. Morgan Kaufmann.Google Scholar
Qin, Y., Zhang, S., Zhu, X., Zhang, J. & Zhang, C. 2009. POP algorithm: kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications 36(2), 27942804.CrossRefGoogle Scholar
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. 2009. Dataset Shift in Machine Learning. MIT Press.Google Scholar
Ramírez-Gallego, S., Garca, S., Mouriño-Taln, H., Martnez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Bentez, J. M. & Herrera, F. 2016. Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 6(1), 521.Google Scholar
Ramírez-Gallego, S., Krawczyk, B., Garca, S., Woźniak, M. & Herrera, F. 2017. A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 3956.CrossRefGoogle Scholar
Reinartz, T. 2002. A unifying view on instance selection. Data Mining and Knowledge Discovery 6(2), 191210.CrossRefGoogle Scholar
Sáez, J. A., Galar, M., Luengo, J. & Herrera, F. 2016. INFFC: an iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Information Fusion 27, 1932.CrossRefGoogle Scholar
Sáez, J. A., Luengo, J. & Herrera, F. 2013. Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognition 46(1), 355364.CrossRefGoogle Scholar
Segata, N., Blanzieri, E., Delany, S. J. & Cunningham, P. 2010. Noise reduction for instance-based learning with a local maximal margin approach. Journal of Intelligent Information Systems 35(2), 301331.CrossRefGoogle Scholar
Shen, C., Wang, X. & Yu, D. 2012. Feature weighting of support vector machines based on derivative saliency analysis and its application to financial data mining. International Journal of Advancements in Computing Technology 4(1), 199206.Google Scholar
Shu, W. & Shen, H. 2016. Multi-criteria feature selection on cost-sensitive data with missing values. Pattern Recognition 51, 268280.CrossRefGoogle Scholar
Silva-Ramírez, E.-L., Pino-Mejas, R., López-Coello, M. & Cubiles-de-la Vega, M.-D. 2011. Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1), 121129.CrossRefGoogle ScholarPubMed
Sim, J., Kwon, O. & Lee, K. C. 2016. Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets. Expert Systems with Applications 46, 485493.CrossRefGoogle Scholar
Skillicorn, D. B. & McConnell, S. M. 2008. Distributed prediction from vertically partitioned data. Journal of Parallel and Distributed Computing 68(1), 1636.CrossRefGoogle Scholar
Smith, M. G. & Bull, L. 2005. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines 6(3), 265281.CrossRefGoogle Scholar
Somol, P. & Pudil, P. 2002. Feature selection toolbox. Pattern Recognition 35(12), 27492759.CrossRefGoogle Scholar
Sun, Y., Wong, A. K. & Kamel, M. S. 2009. Classification of imbalanced data: a review. International Journal of Pattern Recognition and Artificial Intelligence 23(04), 687719.CrossRefGoogle Scholar
Triguero, I., Derrac, J., Garcia, S. & Herrera, F. 2012a. Integrating a differential evolution feature weighting scheme into prototype generation. Neurocomputing 97, 332343.CrossRefGoogle Scholar
Triguero, I., Derrac, J., Garcia, S. & Herrera, F. 2012b. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(1), 86100.CrossRefGoogle Scholar
Tsai, C.-F. & Chang, F.-Y. 2016. Combining instance selection for better missing value imputation. Journal of Systems and Software 122, 6371.CrossRefGoogle Scholar
Tsai, C.-J., Lee, C.-I. & Yang, W.-P. 2008. A discretization algorithm based on class-attribute contingency coefficient. Information Sciences 178(3), 714731.CrossRefGoogle Scholar
Unler, A., Murat, A. & Chinnam, R. B. 2011. mr 2 PSO: a maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Information Sciences 181(20), 46254641.CrossRefGoogle Scholar
van Hulse, J. & Khoshgoftaar, T. 2009. Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68(12), 15131542.CrossRefGoogle Scholar
van Hulse, J. & Khoshgoftaar, T. M. 2006. Class noise detection using frequent itemsets. Intelligent Data Analysis 10(6), 487507.CrossRefGoogle Scholar
van Hulse, J. D., Khoshgoftaar, T. M. & Huang, H. 2007. The pairwise attribute noise detection algorithm. Knowledge and Information Systems 11(2), 171190.CrossRefGoogle Scholar
Virgolin, M., Alderliesten, T., Bel, A., Witteveen, C. & Bosman, P. A. 2018. Symbolic regression and feature construction with gp-gomea applied to radiotherapy dose reconstruction of childhood cancer survivors. In Proceedings of the Genetic and Evolutionary Computation Conference, 1395–1402. ACM.Google Scholar
Wang, B. & Japkowicz, N. 2004. Imbalanced data set learning with synthetic samples. In Proc. IRIS Machine Learning Workshop, 19.Google Scholar
Wettschereck, D., Aha, D. W. & Mohri, T. 1997. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review 11(1–5), 273314.CrossRefGoogle Scholar
Wilson, D. R. & Martinez, T. R. 2000. Reduction techniques for instance-based learning algorithms. Machine Learning 38(3), 257286.CrossRefGoogle Scholar
Witten, I. H., Frank, E., Hall, M. A. & Pal, C. J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.Google Scholar
Wong, T.-T. 2012. A hybrid discretization method for nave bayesian classifiers. Pattern Recognition 45(6), 23212325.CrossRefGoogle Scholar
Wu, X. & Zhu, X. 2008. Mining with noise knowledge: error-aware data mining. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans 38(4), 917932.CrossRefGoogle Scholar
Yang, Y., Webb, G. I. & Wu, X. 2009. Discretization methods. In Data Mining and Knowledge Discovery Handbook. Springer, 101–116.Google Scholar
Zhang, S. 2011. Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123133.CrossRefGoogle Scholar
Zhang, S., Jin, Z. & Zhu, X. 2011. Missing data imputation by utilizing information within incomplete instances. Journal of Systems and Software 84(3), 452459.CrossRefGoogle Scholar
Zhang, S., Zhang, J., Zhu, X., Qin, Y. & Zhang, C. 2008. Missing value imputation based on data clustering. In Transactions on Computational Science I, Springer, 128–138.Google Scholar
Zhou, M.-J. & Chen, X.-J. 2012. An outlier mining algorithm based on dissimilarity. Procedia Environmental Sciences 12, 810814.CrossRefGoogle Scholar
Zhou, Y., Chen, Y., Feng, L., Zhang, X., Shen, Z. & Zhou, X. 2018. Supervised and adaptive feature weighting for object-based classification on satellite images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11(9), 32243234.CrossRefGoogle Scholar
Zhu, X. & Wu, X. 2004. Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review 22(3), 177210.CrossRefGoogle Scholar
20
Cited by

Save article to Kindle

To save this article to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Data preprocessing in predictive data mining
Available formats
×

Save article to Dropbox

To save this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you used this feature, you will be asked to authorise Cambridge Core to connect with your Dropbox account. Find out more about saving content to Dropbox.

Data preprocessing in predictive data mining
Available formats
×

Save article to Google Drive

To save this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you used this feature, you will be asked to authorise Cambridge Core to connect with your Google Drive account. Find out more about saving content to Google Drive.

Data preprocessing in predictive data mining
Available formats
×
×

Reply to: Submit a response

Please enter your response.

Your details

Please enter a valid email address.

Conflicting interests

Do you have any conflicting interests? *