Hostname: page-component-7c8c6479df-xxrs7 Total loading time: 0 Render date: 2024-03-18T12:50:00.470Z Has data issue: false hasContentIssue false

A survey of data mining and knowledge discovery process models and methodologies

Published online by Cambridge University Press:  01 June 2010

Gonzalo Mariscal*
Affiliation:
Universidad Europea de Madrid, C/Tajo, S/N. 28670 - Villaciciosa de Odon, Madrid, Spain
Óscar Marbán*
Affiliation:
Facultad de Informatica, Universidad Politecnica de Madrid, Campus de Montegancedo, 28660 - Boadilla del Monte, Madrid, Spain
Covadonga Fernández*
Affiliation:
Facultad de Informatica, Universidad Politecnica de Madrid, Campus de Montegancedo, 28660 - Boadilla del Monte, Madrid, Spain

Abstract

Up to now, many data mining and knowledge discovery methodologies and process models have been developed, with varying degrees of success. In this paper, we describe the most used (in industrial and academic projects) and cited (in scientific literature) data mining and knowledge discovery methodologies and process models, providing an overview of its evolution along data mining and knowledge discovery history and setting down the state of the art in this topic. For every approach, we have provided a brief description of the proposed knowledge discovery in databases (KDD) process, discussing about special features, outstanding advantages and disadvantages of every approach. Apart from that, a global comparative of all presented data mining approaches is provided, focusing on the different steps and tasks in which every approach interprets the whole KDD process. As a result of the comparison, we propose a new data mining and knowledge discovery process named refined data mining process for developing any kind of data mining and knowledge discovery project. The refined data mining process is built on specific steps taken from analyzed approaches.

Type
Articles
Copyright
Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agrawal, R., Shafer, J. C. 1996. Parallel mining of association rules. IEEE Engineering in Medicine and Biology Magazine Trans. On Knowledge and Data Engineering 8, 962969.Google Scholar
Anand, S., Buchner, A. 1998. Decision Support Using Data Mining. Financial Times Management, 184.Google Scholar
Anand, S. S., Patrick, A. R., Hughes, J. G., Bell, D. A. 1998. A data mining methodology for cross sales. Knowledge-based System Journal 10(7), 449461.CrossRefGoogle Scholar
Arranz, C. 2007. 6 sigma desde la praxis. Experiencias concretas de empresas españnolas, AEC (Asociación Española para la Calidad), chapter ¿Qué Es En Realidad Six-Sigma? 36–46. Morgan Kaufmann.Google Scholar
Barker, J. 1992. Paradigms: The Business of Discovering the Future. HarperBusiness.Google Scholar
Blockeel, H.Moyle, S. 2002. Collaborative data mining needs centralised model evaluation. In Proceedings of ICML’02 Workshop on Data Mining: Lessons Learned, T. Fawcett (ed.), 2128. citeseer.ist.psu.edu/568060.html.Google Scholar
Brachman, R. J., Anand, T. 1996. The process of knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, 3757.Google Scholar
Buchner, A. G., Mulvenna, M. D., Anand, S. S., Hughes, J. G. 1999. An Internet-enabled Knowledge Discovery Process, 13–27. citeseer.ist.psu.edu/290505.html.Google Scholar
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., Zanasi, A. 1997. Discovery Data Mining. From Concept to Implementation. Prentice Hall.Google Scholar
Capra, F. 1996. The Web of Life: A New Scientific Understanding of Living Systems. Anchor Books.Google Scholar
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R. 2000. CRISP-DM 1.0 Step-by-Step Data Mining Guide. Technical report, CRISP-DM.Google Scholar
Cios, K. J., Kurgan, L. A. 2005. Trends in data mining and knowledge discovery. In Advanced Techniques in Knowledge Discovery and Data Mining, Pal, L. C. Jain, N. (eds), Advanced Information and Knowledge Processing. Springer, 126.Google Scholar
Cios, K., Teresinska, A., Konieczna, S., Potocka, J., Sharma, S. 2000. Diagnosing myocardial perfusion from pect bull’s-eye maps — a knowledge discovery approach. IEEE Engineering in Medicine and Biology Magazine 19, 1725.CrossRefGoogle Scholar
de Pisón Ascacibar, F. M. 2003. Optimización Mediante Técnicas de Minería de Datos Del Ciclo de Recocido de Una Línea de Galvanizado. PhD thesis, Univeridad de la Rioja.Google Scholar
Debuse, J. C. W., de la Iglesia, B., Howard, C., Rayward-Smith, V. 2001. Building the KDD Roadmap: A Methodology for Knowledge Discovery. Industrial Knowledge Management. Springer-Verlag, 179–196.Google Scholar
Edelstein, H. A., Edelstein, H. C. 1997. Building, Using, and Managing the Data Warehouse, Data Warehousing Institute, 1st edition. Prentice Hall PTR.Google Scholar
Eisenfeld, B., Kolsky, E., Topolinski, T. 2003a. 42 percent of crm Software Goes Unused. http://www.gartner.com.Google Scholar
Eisenfeld, B., Kolsky, E., Topolinski, T., Hagemeyer, D., Grigg, J. 2003b. Unused CRM Software Increases TCO and Decreases ROI. http://www.gartner.com.Google Scholar
EITO (European Information Technology Observatory) 2007. Eito report 2007.Google Scholar
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. 1996a. From data mining to knowledge discovery: an overview, Advances in Knowledge Discovery and Data Mining, 134. American Association for Artificial Intelligence.Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. 1996b. The KDD PROCESS for extracting useful knowledge from volumes of data. Communication of the ACM 39, 2734. citeseer.ist.psu.edu/fayyad96kdd.html.CrossRefGoogle Scholar
Fayyad, U., Piatetsky-Shapiro, G., Smith, P., Uthurusamy, R. 1996c. Advances in Knowledge Discovey and Data Mining. AAAI/MIT Press.Google Scholar
Gallo, M. A., Hancock, W. M. 2001. Networking Explained. Butterworth-Heinemann.Google Scholar
Gartner, Inc. 2005. Gartner says more than 50 percent of data warehouse projects will have limited acceptance or will be failures through 2007. http://www.gartner.com.Google Scholar
Gartner, Inc. 2008a. Gartner exp survey of more than 1,400 cios shows cios must create leverage to remain relevant to the business.Google Scholar
Gartner, Inc. 2008b. Gartner exp worldwide survey of 1,500 cios shows 85 percent of cios expect Significant Change over next three years. http://www.gartner.com/it/page.jsp?id=587309.Google Scholar
Gertosio, C., Dussauchoy, A. 2004. Knowledge discovery from industrial databases. Journal of Intelligent Manufacturing 15, 2937.CrossRefGoogle Scholar
Gondar, J. E. 2005. Metodología Del Data Mining. Data Mining Institute S. L.Google Scholar
Harman, W. 1970. An Incomplete Guide to the Future. W. W. Norton.Google Scholar
Harry, M., Schroeder, R. 1999. Six Sigma, the Breakthrough Management Strategy Revolutionizing the World’s Top Corporations. Currency.Google Scholar
IBM 1999. Application Programming Interface and Utility Reference. IBM DB2 Intelligent Miner for Data, IBM.Google Scholar
IEEE 1991. Standard for Developing Software Life Cycle Processes. IEEE Std. 1074-1991. IEEE Computer Society.Google Scholar
ISL 1995. Clementine User Guide, Version 5, ISL, Integral Solutions Limited.Google Scholar
ISO 1995. ISO/IEC Standard 12207:1995. Software Life Cycle Processes. International Organization for Standarization.Google Scholar
Jacobson, I., Booch, G., Rumbaugh, J. 1999. The Unified Software Development Process. Addison Wesley Longman Inc.Google Scholar
KdNuggets.Com 2002. Data Mining Methodology. http://www.kdnuggets.com/polls/2002/methodology.htm.Google Scholar
KdNuggets.Com 2007a. Data Mining Activity in 2007 vs 2006. http://www.kdnuggets.com/polls/2007/data_mining_2007_vs_2006.htm.Google Scholar
Khabaza, T., Shearer, C. 1995. Data Mining with Clementine 16(2), 15. London.CrossRefGoogle Scholar
Kriegel, H.-P., Borgwardt, K. M., Kröger, P., Pryakhin, A., Schubert, M., Zimek, A. 2007. Future trends in data mining. Data Mining Knowledge Discovery 15(1), 8797.CrossRefGoogle Scholar
Kurgan, L. A.Musilek, P. 2006. A survey of knowledge discovery and data mining process models. The Knowledge Engineering Review 21(1), 124.CrossRefGoogle Scholar
Marbán, O., Mariscal, G., Menasalvas, E., Segovia, F. J. 2007. An engineering approach to data mining projects. Lecture Notes in Computer Science 4881, 578588. Springer.CrossRefGoogle Scholar
Marbán, O., Segovia, J., Menasalvas, E., Fernandez-Baizan, C. 2008. Towards data mining engineering: a software engineering approach. Information Systems Journal.Google Scholar
McCall, J., Richards, P., Walters, G. 1977. Factors in software quality. NTIS AD-A049-014 015(055).Google Scholar
McConnell, S. 1997. Desarrollo y gestión de proyectos informáticos. McGraw-Hill.Google Scholar
McDonald, M., Blosch, M., Jaffarian, T., Mok, L., Stevens, S. 2006. Growing It’s Contribution: The 2006 Cio Agenda. http://www.gartner.com.Google Scholar
McMurchy, N. 2008. Toolkit Tactical Guideline: Five Success Factors for Effective Bi Initiatives. http://www.gartner.com.Google Scholar
Moyle, S., Jorge, A . 2001. Ramsys—a methodology for supporting rapid remote collaborative data mining projects, ECML/PKDD 2001 Workshop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning: Internal SolEuNet Session, 20–31.Google Scholar
Piatetsky-Shaphiro, G., Frawley, W. 1991. Knowledge Discovery in Databases. AAAI/MIT Press.Google Scholar
Piatetsky-Shapiro, G. 1991. Report on the AAAI-91 Workshop on Knowledge Discovery in Databases. Technical report 6, IEEE Expert.Google Scholar
Piatetsky-Shapiro, G. 2000. Knowledge discovery in databases: 10 years after. SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining 1(2), 5961.CrossRefGoogle Scholar
Pressman, R. S. 2005. Software Engineering: A Practitioner’s Approach, 6th edition. McGraw-Hill Science.Google Scholar
Presutti, G. D. 1999. CRoss industry standard process for data mining: CRISP-DM, 4th CRISP-DM Special Interest Group (SIG) Meeting. http://www.crisp-dm.org, Brussels.Google Scholar
Pyzdek, T. 2003. The Six Sigma Handbook, 2nd edition. McGraw-Hill.Google Scholar
Reinartz, T. 2002. Stages of the Discovery Process. Oxford University Press, Inc., 185192.Google Scholar
Richardson, J., Schlegel, K., Hostmann, B., McMurchy, N. 2008. Magic Quadrant for Business Intelligence Platforms, 2008. http://www.gartner.com.Google Scholar
Sharma, S., Osei-Bryson, K.-M. 2009. Framework for formal implementation of the business understanding phase of data mining projects. Expert Systems with Applications 36(2), 41144124.CrossRefGoogle Scholar
Shearer, C. 1996. User driven data mining. Unicom Data Mining Conference. London.Google Scholar
Solarte, J. 2002. A Proposed Data Mining Methodoloy and Its Aplication to Industrial Engineering, Master’s thesis, University of Tennessee, Knoxville.Google Scholar
SpringerLink 2008. Data Mining and Knowledge Discovery. http://www.springerlink.com/content/100254/.Google Scholar
SPSS 2007. Spss Website. http://www.spss.com.Google Scholar
StatSoft, I. 2005. Data Mining Techniques. http://www.statsoftinc.com/textbook/stathome.html.Google Scholar
Strand, M. 2000. The Business Value of Data Warehouses–Opportunities, Pitfalls and Future Directions. PhD thesis, Department of Computer Science, University of Skövde.Google Scholar
The CRISP-DM Consortium 2008. The crisp-dm Blog. http://crispdm.wordpress.com.Google Scholar
The Data Mining Research Group 1997. DBMiner User Manual. Simnon Fraser University, Intelligent Database Systems Laboratory.Google Scholar
Tkach, D. 1998. Information Mining with the IBM Intelligent Miner Family. IBM Software Solutions White Paper.Google Scholar
Two Crows Corporation 1998. Introduction to Data Mining and Knowledge Discovery, 2nd edition. Two Crows Corporation. ISBN 892095-00-0.Google Scholar
Two Crows Corporation 1999. Introduction to Data Mining and Knowledge Discovery, 3rd edition. Two Crows Corporation. ISBN 1-892095-02-5.Google Scholar
Tyrrell, S. 2000. The many dimensions of the software process. ACM Crossroads 6(4), 2226.CrossRefGoogle Scholar
Witten, I. H., Frank, E. 2005. Data Mining: Practical Machine Learning Tools with Java Implementations, 2nd edition. Morgan Kaufmann.Google Scholar
Yang, Q., Wu, X. 2006. 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5(4), 597604.CrossRefGoogle Scholar
Zornes, A. 2003. The top 5 global 3000 data mining trends for 2003/04. META Group Research-Delta Summary 2061, 1–20.Google Scholar