Skip to main content
  • This chapter is unavailable for purchase
  • Cited by 24
  • Cited by
    This chapter has been cited by the following publications. This list is generated based on data provided by CrossRef.

    Zhao, Guifen Liu, Yanjun Zhang, Wei and Wang, Yiou 2018. TFIDF based Feature Words Extraction and Topic Modeling for Short Text. p. 188.

    Rückemann, Claus-Peter 2018. Methodology enabling knowledge mining computation based on conceptual knowledge and verbal description. Vol. 1978, Issue. , p. 070002.

    Cui, Handong Huang, Delu Fang, Yong Liu, Liang and Huang, Cheng 2018. Webshell Detection Based on Random Forest–Gradient Boosting Decision Tree Algorithm. p. 153.

    Wu, Yilang Wang, Junbo and Cheng, Zixue 2017. Activity awareness for development support based on seamless repository. International Journal of Machine Learning and Cybernetics,

    Yaşar, Abdurrahman Gedik, Buğra and Ferhatosmanoğlu, Hakan 2017. Distributed block formation and layout for disk-based management of large-scale graphs. Distributed and Parallel Databases, Vol. 35, Issue. 1, p. 23.

    Wu, Wenchao Zheng, Yixian Cao, Nan Zeng, Haipeng Ni, Bing Qu, Huamin and Ni, Lionel M. 2017. MobiSeg: Interactive region segmentation using heterogeneous mobility data. p. 91.

    Dani, Mohamed Cherif Doreau, Henri and Alt, Samantha 2017. Advances in Artificial Intelligence: From Theory to Practice. Vol. 10351, Issue. , p. 201.

    Haneem, Faizura Ali, Rosmah Kama, Nazri and Basri, Sufyan 2017. Descriptive analysis and text analysis in Systematic Literature Review: A review of Master Data Management. p. 1.

    Wandabwa, Herman Zhang, Defu and Sammy, Korir 2016. Text Categorization via Attribute Distance Weighted k-Nearest Neighbor Classification. p. 225.

    Murali Krishna, R. V. V. and Satyananda Reddy, Ch. 2016. Computational Intelligence in Data Mining—Volume 1. Vol. 410, Issue. , p. 261.

    Seeliger, Alexander Schmidt, Benedikt Schweizer, Immanuel and Mühlhäuser, Max 2016. What Belongs Together Comes Together. p. 60.

    Bilal, Muhammad Israr, Huma Shahid, Muhammad and Khan, Amin 2016. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques. Journal of King Saud University - Computer and Information Sciences, Vol. 28, Issue. 3, p. 330.

    Chen, Yu-Ching Yang, Chia-Ching Liau, Yan-Jian Chang, Chia-Hui Chen, Pin-Liang Yang, Ping-Che and Ku, Tsun 2016. User behavior analysis and commodity recommendation for point-earning apps. p. 170.

    Li Kung and Hsiao-Fan Wang 2015. A recommender system for the optimal combination of energy resources with cost-benefit analysis. p. 1.

    Li, Guofu Ghosh, Aniruddha and Veale, Tony 2015. Constructing A Corpus Of Figurative Language For a Tweet Classification and Retrieval Task. p. 130.

    Schindler, Mirco Fox, Oliver and Rausch, Andreas 2015. Clustering Source Code Elements by Semantic Similarity Using Wikipedia. p. 13.

    de Silva, N. H. N. D. 2015. SAFS3 algorithm: Frequency statistic and semantic similarity based semantic classification use case. p. 77.

    Kanter, James Max and Veeramachaneni, Kalyan 2015. Deep feature synthesis: Towards automating data science endeavors. p. 1.

    Viriyavisuthisakul, Supatta Sanguansat, Parinya Charnkeitkong, Pisit and Haruechaiyasak, Choochart 2015. A comparison of similarity measures for online social media Thai text classification. p. 1.

    Wang, Shuguang and Han, Sam 2015. BreakFast: Analyzing Celerity of News. p. 917.

  • Print publication year: 2011
  • Online publication date: June 2012

1 - Data Mining


In this intoductory chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute to this field. We cover “Bonferroni's Principle,” which is really a warning about overusing the ability to mine data. This chapter is also the place where we summarize a few useful ideas that are not data mining but are useful in understanding some important data-mining concepts. These include the TF.IDF measure of word importance, behavior of hash functions and indexes, and identities involving e, the base of natural logarithms. Finally, we give an outline of the topics covered in the balance of the book.

What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of “models” for data. A “model,” however, can be one of several things. We mention below the most important directions in modeling.

Statistical Modeling

Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Section 1.2 illustrates the sort of errors one can make by trying to extract what really isn't in the data. Today, “data mining” has taken on a positive meaning. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.

Recommend this book

Email your librarian or administrator to recommend adding this book to your organisation's collection.

Mining of Massive Datasets
  • Online ISBN: 9781139058452
  • Book DOI:
Please enter your name
Please enter a valid email address
Who would you like to send this to *