Skip to main content
  • This chapter is unavailable for purchase
  • Cited by 28
  • Cited by
    This chapter has been cited by the following publications. This list is generated based on data provided by CrossRef.

    Rückemann, Claus-Peter 2018. Methodology enabling knowledge mining computation based on conceptual knowledge and verbal description. Vol. 1978, Issue. , p. 070002.

    van den Boom, Bernard and Veenman, Cor J. 2018. Finding Dutch natives in online forums. Forensic Sciences Research, Vol. 3, Issue. 3, p. 230.

    Huq, Khandaker Tasnim Mollah, Abdus Selim and Sajal, Md. Shakhawat Hossain 2018. Big Data, Cloud and Applications. Vol. 872, Issue. , p. 105.

    Liu, Wenbin Chen, Wei Jia, Lili and Lv, Yueguang 2018. Automatic human hallmark recognition based on visual words . p. 75.

    Cui, Handong Huang, Delu Fang, Yong Liu, Liang and Huang, Cheng 2018. Webshell Detection Based on Random Forest–Gradient Boosting Decision Tree Algorithm. p. 153.

    Zhao, Guifen Liu, Yanjun Zhang, Wei and Wang, Yiou 2018. TFIDF based Feature Words Extraction and Topic Modeling for Short Text. p. 188.

    An, Bang Wu, Wenjun and Han, Huimin 2018. Deep Active Learning for Text Classification. p. 1.

    Wu, Wenchao Zheng, Yixian Cao, Nan Zeng, Haipeng Ni, Bing Qu, Huamin and Ni, Lionel M. 2017. MobiSeg: Interactive region segmentation using heterogeneous mobility data. p. 91.

    Wu, Yilang Wang, Junbo and Cheng, Zixue 2017. Activity awareness for development support based on seamless repository. International Journal of Machine Learning and Cybernetics,

    Dani, Mohamed Cherif Doreau, Henri and Alt, Samantha 2017. Advances in Artificial Intelligence: From Theory to Practice. Vol. 10351, Issue. , p. 201.

    Haneem, Faizura Ali, Rosmah Kama, Nazri and Basri, Sufyan 2017. Descriptive analysis and text analysis in Systematic Literature Review: A review of Master Data Management. p. 1.

    Yaşar, Abdurrahman Gedik, Buğra and Ferhatosmanoğlu, Hakan 2017. Distributed block formation and layout for disk-based management of large-scale graphs. Distributed and Parallel Databases, Vol. 35, Issue. 1, p. 23.

    Bilal, Muhammad Israr, Huma Shahid, Muhammad and Khan, Amin 2016. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques. Journal of King Saud University - Computer and Information Sciences, Vol. 28, Issue. 3, p. 330.

    Seeliger, Alexander Schmidt, Benedikt Schweizer, Immanuel and Mühlhäuser, Max 2016. What Belongs Together Comes Together. p. 60.

    Murali Krishna, R. V. V. and Satyananda Reddy, Ch. 2016. Computational Intelligence in Data Mining—Volume 1. Vol. 410, Issue. , p. 261.

    Wandabwa, Herman Zhang, Defu and Sammy, Korir 2016. Text Categorization via Attribute Distance Weighted k-Nearest Neighbor Classification. p. 225.

    Chen, Yu-Ching Yang, Chia-Ching Liau, Yan-Jian Chang, Chia-Hui Chen, Pin-Liang Yang, Ping-Che and Ku, Tsun 2016. User behavior analysis and commodity recommendation for point-earning apps. p. 170.

    Wang, Shuguang and Han, Sam 2015. BreakFast: Analyzing Celerity of News. p. 917.

    de Silva, N. H. N. D. 2015. SAFS3 algorithm: Frequency statistic and semantic similarity based semantic classification use case. p. 77.

    Schindler, Mirco Fox, Oliver and Rausch, Andreas 2015. Clustering Source Code Elements by Semantic Similarity Using Wikipedia. p. 13.

  • Print publication year: 2011
  • Online publication date: June 2012

1 - Data Mining


In this intoductory chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute to this field. We cover “Bonferroni's Principle,” which is really a warning about overusing the ability to mine data. This chapter is also the place where we summarize a few useful ideas that are not data mining but are useful in understanding some important data-mining concepts. These include the TF.IDF measure of word importance, behavior of hash functions and indexes, and identities involving e, the base of natural logarithms. Finally, we give an outline of the topics covered in the balance of the book.

What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of “models” for data. A “model,” however, can be one of several things. We mention below the most important directions in modeling.

Statistical Modeling

Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Section 1.2 illustrates the sort of errors one can make by trying to extract what really isn't in the data. Today, “data mining” has taken on a positive meaning. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.

Recommend this book

Email your librarian or administrator to recommend adding this book to your organisation's collection.

Mining of Massive Datasets
  • Online ISBN: 9781139058452
  • Book DOI:
Please enter your name
Please enter a valid email address
Who would you like to send this to *