Skip to main content
×
×
Home

Scaling up classification rule induction through parallel processing

  • Frederic Stahl (a1) and Max Bramer (a2)
Abstract
Abstract

The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.

Copyright
References
Hide All
Berrar D., Stahl F., Silva C. S. G., Rodrigues J. R., Brito R. M. M., Dubitzky W. 2005. Towards data warehousing and mining of protein unfolding simulation data. Journal of Clinical Monitoring and Computing 19, 307317.
Bramer M. A. 2000. Automatic induction of classification rules from examples using N-Prism. In Research and Development in Intelligent Systems XVI, Bramer, M. A., Macintosh, A. & Coenen, F. (eds). Springer-Verlag, 99121.
Bramer M. A. 2002. An information-theoretic approach to the pre-pruning of classification rules. In Intelligent Information Processing, Musen, B. N. M. & Studer, R. (eds). Kluwer, 201212.
Bramer M. A. 2005. Inducer: a public domain workbench for data mining. International Journal of Systems Science 36(14), 909919.
Bramer M. A. 2007. Principles of Data Mining. Springer.
Breiman L. 1996. Bagging predictors. Machine Learning 24(2), 123140.
Breiman L. 2001. Random forests. Machine Learning 45(1), 532.
Breiman L., Friedman J. H., Olshen R. A., Stone C. J. 1984. Classification and regression trees. Wadsworth Publishing Company.
Caragea D., Silvescu A., Honavar V. 2003. Decision tree induction from distributed heterogeneous autonomous data sources. In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA 03). Springer-Verlag, 341–350.
Catlett J. 1991. Megainduction: Machine Learning on Very Large Databases. Unpublished doctoral dissertation, University of Technology Sydney.
Cendrowska J. 1987. PRISM: an algorithm for inducing modular rules. International Journal of Man–Machine Studies 27, 349370.
Chan P., Stolfo S. J. 1993a. Experiments on multistrategy learning by meta learning. In Proceedings of 2nd International Conference on Information and Knowledge Management, Washington, DC, United States, 314–323.
Chan P., Stolfo S. J. 1993b. Meta-Learning for multi strategy and parallel learning. In Proceedings of 2nd International Workshop on Multistrategy Learning, Harpers Ferry, West Virginia United States, 150–165.
Clark P., Niblett T. 1989. The CN2 induction algorithm. Machine Learning 3(4), 261283.
Cohen W. W. 1995. Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 115–123.
Erman L. D., Hayes-Roth F., Lesser V. R., Reddy D. R. 1980. The Hearsay-II Speech-Understanding system: integrating knowledge to resolve uncertainty. ACM Computing Surveys (CSUR) 12(2), 213253.
Freitas A. 1998. A survey of parallel data mining. In Proceedings of the 2nd International Conference on the Practical Applications of Knowledge Discovery and Data Mining, London, 287–300.
Frey L. J., Fisher D. H. 1999. Modelling decision tree performance with the power law. In Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, Florida, USA, 59–65.
Fuernkranz J. 1998. Integrative windowing. Journal of Artificial Intelligence Research 8, 129164.
Goldberg D. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley.
Han J., Kamber M. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann.
Hillis W., Steele L. 1986. Data parallel algorithms. Communications of the ACM 29(12), 11701183.
Ho T. K. 1995. Random decision forests. Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, 1, 278.
Hunt E. B., Stone P. J., Marin J. 1966. Experiments in Induction. Academic Press.
Joshi M., Karypis G., Kumar V. 1998. Scalparc: a new scalable and efficient parallel classification algorithm for mining large datasets. In Proceedings of the 1st Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, IPPS/SPDP 1998, Orlando, Florida, 573–579.
Kargupta H., Park B. H., Hershberger D., Johnson E. 1999. Collective data mining: a new perspective toward distributed data analysis. In Advances in Distributed and Parallel Knowledge Discovery, Kargupta, H. & Chan, P. (eds). AAAI/MIT Press, 133184.
Kerber R. 1992. Chimerge: discretization of numeric attributes. In Proceedings of the AAAI, San Jose, California, 123–128.
Lippmann R. P. 1988. An introduction to computing with neural nets. SIGARCH Computer Architecture News 16(1), 725.
Metha M., Agrawal R., Rissanen J. 1996. SLIQ: a fast scalable classifier for data mining. In Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology. Springer, 1057, 18–32.
Michalski R. S. 1969. On the Quasi-Minimal solution of the general covering problem. In Proceedings of the 5th International Symposium on Information Processing, Bled, Yugoslavia, 125–128.
Minitab 2010. (http://www.minitab.com/).
Park B., Kargupta H. 2002. Distributed data mining: algorithms, systems and applications. In Data Mining Handbook. IEA, 341358.
Provost F. 2000. Distributed data mining: scaling up and beyond. In Advances in Distributed and Parallel Knowledge Discovery, Kargupta, H. & Chan, P. (eds). MIT Press, 327.
Provost F., Hennessy D. N. 1994. Distributed machine learning: scaling up with coarse-grained parallelism. In Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, Stanford, California, 340–347.
Provost F., Hennessy D. N. 1996. Scaling up: distributed machine learning with cooperation. In Proceedings of the 13th National Conference on Artificial Intelligence. AAAI Press, 74–79.
Provost F., Jensen D., Oates T. 1999. Efficient progressive sampling. In Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM, 23–32.
Quinlan R. J. 1979a. Discovering rules by induction from large collections of examples. In Expert Systems in the Micro-Electronic Age. Edinburgh University Press.
Quinlan R. J. 1979b. Induction Over Large Databases. Michie, D. (ed.). Technical No. STAN-CS-739, Stanford University, 168–201.
Quinlan R. J. 1983. Learning efficient classification procedures and their applications to chess endgames. In Machine Learning: An AI Approach, Michalski, R. S., Carbonell, J. G. & Mitchell, T. M. (eds). Morgan Kaufmann, 463482.
Quinlan R. J. 1986. Induction of decision trees. Machine Learning 1(1), 81106.
Quinlan R. J. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.
SAS/STAT 2010. (http://www.sas.com/).
Segal M. R. 2004. Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics, University of California.
Shafer J., Agrawal R., Metha M. 1996. SPRINT: a scalable parallel classifier for data mining. In Proceedings of the 22nd International Conference on Very Large Databases. Morgan Kaufmann, 544–555.
Shannon C. E. 1948. A mathematical theory of communication. The Bell System Technical Journal 27.
Sirvastava A., Han E., Kumar V., Singh V. 1999. Parallel formulations of Decision-Tree classification algorithms. Data Mining and Knowledge Discovery 3, 237261.
Smyth P., Goodman R. M. 1992. An information theoretic approach to rule induction from databases. Transactions on Knowledge and Data Engineering 4(4), 301316.
Stahl F. 2009. Parallel Rule Induction. Unpublished doctoral dissertation, University of Portsmouth.
Stahl F., Berrar D., Silva C. S. G., Rodrigues J. R., Brito R. M. M., Dubitzky W. 2005. Grid warehousing of molecular dynamics protein unfolding data. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE/ACM, 496–503.
Stahl F., Bramer M., Adda M. 2008. Parallel induction of modular classification rules. In Proceedings of SGAI Conference (p. lookup-lookup). Springer.
Stahl F., Bramer M., Adda M. 2009a. Parallel rule induction with information theoretic pre-pruning. In Proceedings of the SGAI Conference, 151–164.
Stahl F., Bramer M. A., Adda M. 2009b. PMCRI: a parallel modular classification rule induction framework. In Proceedings of MLDM. Springer, 148–162.
Stahl F., Bramer M., Adda M. 2010. J-PMCRI: a methodology for inducing pre-pruned modular classification rules. In Artificial Intelligence in Theory and Practice III, Bramer, M. A. (ed.). Springer, 4756.
Stankovski V., Swain M., Kravtsov V., Niessen T., Wegener D., Roehm M. 2008. Digging deep into the data mine with DataMiningGrid. IEEE Internet Computing 12, 6976.
Szalay A. 1998. The Evolving Universe. ASSL 231.
Way J., Smith E. A. 1991. The evolution of synthetic aperture radar systems and their progression to the EOS SAR. IEEE Transactions on Geoscience and Remote Sensing 29(6), 962985.
Wirth J., Catlett J. 1988. Experiments on the costs and benefits of windowing in ID3. In Proceedings of the 5th International Conference on Machine Learning(ML-88). Morgan Kaufmann, 87–95.
Witten I. H., Eibe F. 1999. Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

The Knowledge Engineering Review
  • ISSN: 0269-8889
  • EISSN: 1469-8005
  • URL: /core/journals/knowledge-engineering-review
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 2
Total number of PDF views: 12 *
Loading metrics...

Abstract views

Total abstract views: 140 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 23rd February 2018. This data will be updated every 24 hours.