Pattern Recognition and Machine Learning

Summary

Chapter Objectives

✓ To discuss the major issues of relational databases

✓ To understand the need for NoSQL

✓ To comprehend the characteristics of NoSQL

✓ To understand different data models of NoSQL

✓ To understand the concept of the CAP theorem

✓ To discuss the future of NoSQL

After about half a century of dominance of relational database, the current excitement about NoSQL databases comes as a big surprise. In this chapter, we'll explore the challenges faced by relational databases due to changing technological paradigms and why the current rise of NoSQL databases is not a flash in the pan.

Let us start our discussion by looking at relational databases.

The Rise of Relational Databases

Dr E. F Codd proposed the relational model in 1969. It was soon adopted by the mainstream software industries due to its simplicity and efficiency replacing hierarchical and network models that were prevalent at that time. The timeline showing the rise of the relational model is depicted in Figure 15.1.

The reasons for the success of relational databases were their simplicity, the power of SQL, support for transaction management, concurrency control, and recovery management.

Major Issues with Relational Databases

The relational data model organizes data in rows and columns that are arranged in a tabular form. In the relational model, a row is known as a tuple which is a set of key-value pairs and a relation is a set of these tuples. All operations in SQL consume and return relations. This foundation based on relations provides a certain elegance and simplicity, but it also suffers some limitations. The values in a relational tuple have to be simple (atomic)—they cannot contain any structure, such as a nested record or a list.

This limitation is not true for in-memory data structures, which can take on much richer structures than relations. As a result, if you want to use a richer in-memory data structure, you would have to translate it to a relational representation to store it on disk. This problem is known as impedance mismatch i.e. two different representations that require inter-translation as shown in Figure 15.2.

The impedance mismatch is a major source of frustration for application developers. In the 1990s many experts believed that impedance mismatch would lead to relational databases being replaced with databases that replicate the in-memory data structures to disk.

Summary

Chapter Objectives

✓ To demonstrate the use of the association mining algorithm.

✓ To apply association mining on numeric data

✓ To comprehend the use of class association rules

✓ To compare the decision tree classifier with association mining

✓ To conduct association mining with R language

Association Mining with Weka

Let us consider the ‘to-play-or-not-to-play’ dataset given in Figure 10.1 for getting hands on experience with association mining in Weka. This dataset is available as default dataset in the data folder of Weka with the file name weather.nominal.arff.

This dataset has four attributes describing weather conditions and a fifth attribute is a class attribute that indicates based on the weather conditions of the day, whether Play was held or not. There are 14 instances, or samples in this dataset.

It is important to note that in classification, we are interested in assigning the output attribute to play or no play. But in Association mining we are interested in finding association rules based on the associations between all the attributes that came together. Thus, in association we do not take class attributes into consideration.

If we compare this dataset with the transactions dataset discussed in the last chapter for market basket analysis, you can find equivalence between transaction id and data items purchased in that transaction.

Here, No. 1 to 14, i.e. the instances act as transaction ids and the values of attributes given in the row corresponding to the given instance are acting as data items for that instance. Here we are interested in finding associations by observing the facts like Outlook = sunny AND Temperature = hot is more common than the association of Outlook = sunny AND Temperature = cooloccurring together as shown in Figure 10.2.

Weka contains an Associate tab which aids in applying different association algorithms in order to find association rules from datasets. One such algorithm is the Predictive Apriori association algorithm that optimally combines support and confidence to calculate a value called predictive accuracy as depicted in Figure 10.3.

The user only needs to specify how many rules they would like the algorithm to generate, and the algorithm takes care of optimizing support and confidence to find the best rules.

Summary

Chapter Objectives

✓ To understand what is meant by web mining and its types

✓ To understand the working of the HITS algorithm

✓ To know the brief history of search engines

✓ To understand a search engine's architecture and its working

✓ To understand the PageRank algorithm and its working

✓ To understand the concepts of precision and recall

Introduction

Since Berners-Lee (inventor of the World Wide Web) created the first web page in 1991, there has been an exponential growth in the number of websites worldwide. As of 2018, there were 1.8 billion websites in the world. This growth has been accompanied with another exponential increase in the amount of data available and the need to organize this data in order to extract useful information from it.

Early attempts to organize such data included creation of web directories to group together similar web pages. The web pages in these directories were often manually reviewed and tagged based on keywords. As time passed by, search engines became available which employed a variety of techniques in order to extract the required information from the web pages. These techniques are called web mining. Formally, web mining is the application of data mining techniques and machine learning to find useful information from the data present in web pages.

Web mining is divided into three parts, i.e. web content mining, structure mining, and usage mining as shown in Figure 11.1.

We will discuss each type of web mining in brief.

Web Content Mining

Web content mining deals with extracting relevant knowledge from the contents of a web page. During content mining, we totally ignore how other web pages link to a given web page or how users interact with it. A trivial approach to web content mining is based on location and frequency of keywords. But this gives rise to two problems: first, the problem of scarcity and second, the problem of abundance. The problem of scarcity occurs with those queries that either generate a few results or no results at all. The problem of abundance occurs with the queries that generate too many search results. The root cause of both the problems is the nature of data present on the web. The data is usually present in the form of HTML which is semi-structured and useful information is generally scattered across multiple web pages.

Summary

In the modern age of artificial intelligence and business analytics, data is considered as the oil of this cyber world. The mining of data has huge potential to improve business outcomes, and to carry out the mining of data there is a growing demand for database mining experts. This book intends training learners to fill this gap.

This book will give learners sufficient information to acquire mastery over the subject. It covers the practical aspects of data mining, data warehousing, and machine learning in a simplified manner without compromising on the details of the subject. The main strength of the book is the illustration of concepts with practical examples so that the learners can grasp the contents easily. Another important feature of the book is illustration of data mining algorithms with practical hands-on sessions on Weka and R language (a major data mining tool and language, respectively). In this book, every concept has been illustrated through a step-by-step approach in tutorial form for self-practice in Weka and R. This textbook includes many pedagogical features such as chapter wise summary, exercises including probable problems, question bank, and relevant references, to provide sound knowledge to learners. It provides the students a platform to obtain expertise on technology, for better placements.

Video sessions on data mining, machine learning, big data and DBMS are also available on my YouTube channel. Learners are requested to subscribe to this channel https://www.youtube.com/user/parteekbhatia to get the latest updates through video sessions on these topics.

Your suggestions for further improvements to the book are always welcome. Kindly e-mail your suggestions to parteek.bhatia@gmail.com.

I hope you enjoy learning from this book as much as I enjoyed writing it.

Summary

Chapter Objectives

✓ To comprehend the concept of clustering, its applications, and features.

✓ To understand various distance metrics for clustering of data.

✓ To comprehend the process of K-means clustering.

✓ To comprehend the process of hierarchical clustering algorithms.

✓ To comprehend the process of DBSCAN algorithms.

Introduction to Cluster Analysis

Generally, in the case of large datasets, data is not labeled because labeling a large number of records requires a great deal of human effort. The unlabeled data can be analyzed with the help of clustering techniques. Clustering is an unsupervised learning technique which does not require a labeled dataset.

Clustering is defined as grouping a set of similar objects into classes or clusters. In other words, during cluster analysis, the data is grouped into classes or clusters, so that records within a cluster (intra-cluster) have high similarity with one another but have high dissimilarities in comparison to objects in other clusters (inter-cluster), as shown in Figure 7.1.

The similarity of records is identified on the basis of values of attributes describing the objects. Cluster analysis is an important human activity. The first human beings Adam and Eve actually learned through the process of clustering. They did not know the name of any object, they simply observed each and every object. Based on the similarity of their properties, they identified these objects in groups or clusters. For example, one group or cluster was named as trees, another as fruits and so on. They further classified the fruits on the basis of their properties like size, colour, shape, taste, and others. After that, people assigned labels or names to these objects calling them mango, banana, orange, and so on. And finally, all objects were labeled. Thus, we can say that the first human beings used clustering for their learning and they made clusters or groups of physical objects based on the similarity of their attributes.

Applications of Cluster Analysis

Cluster analysis has been widely used in various important applications such as:

Summary

Chapter Objectives

✓ To demonstrate the use of the decision tree

✓ To apply the decision tree on a sample dataset

✓ To implement a decision tree process using Weka and R

Building a Decision Tree Classifier in Weka

In this chapter, we will learn how Weka's decision tree feature helps to classify unknown samples of a dataset based on its attribute values. When Weka's decision tree is applied to an unknown sample, the decision tree classifies the sample into different classes such as Class A, Class B and Class C as shown in Figure 6.1.

For example, if we want to predict the class of an unknown sample of a flower based on the length and width dimensions of its Sepal and Petal. The first step would be to measure Sepal length and width and Petal length and width of an unknown flower and compare these dimensions to the values of the samples in our dataset of known species. The decision tree algorithm of Weka will help in creating decision rules to predict the class of unknown flower automatically as shown in Figure 6.2.

As shown in Figure 6.2, the dimensions of an unknown sample of flower will be matched with the rules generated by the decision tree. First, the rules will be matched to determine whether the sample belongs to Setosa class or not, if yes, the unknown sample will be classified as setosa. If not, the unknown sample will be checked for being of the Virginica class. If it matches with the conditions of the Virginica class, it will be labeled as Virginica, otherwise Versicolor. It is important to note that it would not be simple to create these rules on the basis of the values of single attribute as shown in Table 6.1. It is clear that for the same Sepal width, the flower may be of Setosa or Versicolor or Virginica, making it unclear which species an unknown flower belongs to on the basis of Sepal width alone. Thus, the decision tree must make its prediction based on all four flower dimensions.

Due to such overlaps, the decision tree cannot predict with 100% accuracy the class of flower, but can only determine the likelihood of an unknown sample belonging to a particular class. In real situations the decision tree algorithm works on the basis of probability.

Summary

Chapter Objectives

✓ To learn about the concepts of data mining.

✓ To understand the need for, and the applications of data mining

✓ To differentiate between data mining and machine learning

✓ To understand the process of data mining.

✓ To understand the difference between data mining and machine learning.

Introduction to Data Mining

In the age of information, an enormous amount of data is available in different industries and organizations. The availability of this massive data is of no use unless it is transformed into valuable information. Otherwise, we are sinking in data, but starving for knowledge. The solution to this problem is data mining which is the extraction of useful information from the huge amount of data that is available.

Data mining is defined as follows:

‘Data mining is a collection of techniques for efficient automated discovery of previously unknown, valid, novel, useful and understandable patterns in large databases. The patterns must be actionable so they may be used in an enterprise's decision making.’

From this definition, the important take aways are:

Summary

Chapter Objectives

✓ To understand the concept of dimension, measure and the fact table

✓ To able to apply different schema of data warehouse designs such as Star Schema, Snowflake Schema and Fact Constellation Schema on real world applications.

✓ To understand the differences between these schemas, their strengths and weakness.

Introduction to Data Warehouse Schema

Logical descriptions of database are known as Schema. It is the blueprint of the entire database. It defines how the data are organized and how the relations among them are associated. Data warehouse schema consists of the name and description of records including associated data items and aggregates. A database uses relational models whereas a data warehouse uses different types of schema, namely, Star, Snowflake, and Fact Constellation.

To start discussion on these schemas, it is important to understand the basic terminology used in this process, which is discussed below.

Dimension

The term ‘dimension’ in data warehousing is a collection of reference information about a measurable event. These events are stored in a fact table and are known as facts. The dimensions are generally the entities for which an organization wants to preserve records. The descriptive attributes are organized as columns in dimension tables by a data warehouse. For example, a student's dimension attributes could consist of first and last name, roll number, age, gender, or an address dimension that would include street name, state, and country attributes.

A dimension table consists of a primary key column that uniquely identifies each record (row) of dimension. A dimension is a framework that consists of one or more hierarchies that classify data. Usually dimensions are de-normalized tables and may have redundant data.

Let us take a quick recap of the concepts of normalization and de-normalization, as they will be used in this chapter. Normalization is a process of breaking up a larger table into smaller tables free of any possible insertion, updation or deletion anomalies. Normalized tables have reduced redundancy of data. In order to get full information, these tables are usually joined.

In de-normalization, smaller tables are merged to form larger tables to reduce joining operations. De-normalization is particularly performed in those cases where retrieval is a major requirement and insert, update, and delete operations are minimal, as in case of historical data or data warehouse. These de-normalized tables will have redundancy of data.

Summary

Chapter Objectives

✓ To understand the concept of machine learning and its applications.

✓ To understand what are supervised and unsupervised machine learning strategies.

✓ To understand the concept of regression and classification.

✓ To identify the strategy to be applied to a given problem.

Introduction to Machine Learning

Machine Learning (ML) has emerged as the most extensively used tool for web-sites to classify surfers and address them appropriately. When we surf the Net, we are exposed to machine learning algorithms multiple times a day, often without realizing it. Machine learning is used by search engines such as Google and Bing to rank web pages or to decide which advertisement to show to which user. It is used by social networks such as Facebook and Instagram to generate a custom feed for every user or to tag the user by the picture that was uploaded. It is also used by banks to detect whether an online transaction is genuine or fraudulent and by e-commerce websites such as Amazon and Flipkart to recommend products that we are most likely to buy. Even email providers such as Gmail, Yahoo, and Hotmail use machine learning to decide which emails are spam and which are not. These are only a few examples of applications of machine learning.

The ultimate aim of machine learning is to build an Artificial Intelligence (AI) platform that is as intelligent as the human mind. We are not very far from this dream and many AI researchers believe that this goal can be achieved through machine learning algorithms that try to mimic the learning processes of a human brain.

Actually, ML is a branch of AI. Many years ago researchers tried to build intelligent programs with pre-defined rules like in the case of a normal program. But this approach did not work as there were too many special cases to be considered. For instance, we can define rules to find the shortest path between two points. But it is very difficult to make rules for programs such as photo tagging, classifying emails as spam or not spam, and web page ranking. The only solution to accomplish these tasks was to write a program that could generate its own rules by examining some examples (also called training data). This approach was named Machine Learning. This book will cover state of art machine learning algorithms and their deployment.

Summary

Chapter Objectives

✓ To learn to install Weka and the R language

✓ To demonstrate the use of Weka software

✓ To experiment with Weka on the Iris dataset

✓ To introduce basics of R language

✓ To experiment with R on the Iris dataset

About Weka

In this book, all data mining algorithms are explained with Weka and R language. The learner can perform and apply these algorithms easily using these well-know data mining tool and language. Let's first discuss the Weka tool.

Weka is an open-source software under the GNU General Public License System. It was developed by the Machine Learning Group, University of Waikato, New Zealand. Although named after a flightless New Zealand bird, ‘WEKA’ stands for Waikato Environment for Knowledge Analysis. The system is written using the object oriented language Java. Weka is data mining software and it is a set of machine learning algorithms that can be applied to a dataset directly, or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

The story of the development of Weka is very interesting. It was initially developed by students of University of Waikato, New Zealand, as part of their course work on data mining. They had implemented all major machine learning algorithms as part of lab work for this course. In 1993, the University of Waikato began development of the original version of Weka, which became a mix of Tcl/Tk, C, and Makefiles. In 1997, the decision was made to redevelop Weka from scratch in Java, including implementations of modeling algorithms. In 2006, Pentaho Corporation acquired an exclusive license to use Weka for business intelligence.

This chapter will cover the installation of Weka, datasets available and will guide the learner about how to start experimentation using Weka. Later on we will discuss another data mining tool, R. Let us first discuss the installation process for Weka, step-by-step.

Installing Weka

Weka is freely available and its latest version can be easily downloaded from https://www.cs.waikato. ac.nz/ml/weka/downloading.html as shown in Figure 3.1.

To work more smoothly, you must first download and install Java VM before downloading Weka.

Written in lucid language, this valuable textbook brings together fundamental concepts of data mining and data warehousing in a single volume. Important topics including information theory, decision tree, Naïve Bayes classifier, distance metrics, partitioning clustering, associate mining, data marts and operational data store are discussed comprehensively. The textbook is written to cater to the needs of undergraduate students of computer science, engineering and information technology for a course on data mining and data warehousing. The text simplifies the understanding of the concepts through exercises and practical examples. Chapters such as classification, associate mining and cluster analysis are discussed in detail with their practical implementation using Weka and R language data mining tools. Advanced topics including big data analytics, relational data models and NoSQL are discussed in detail. Pedagogical features including unsolved problems and multiple-choice questions are interspersed throughout the book for better understanding.

Written by leading researchers, this complete introduction brings together all the theory and tools needed for building robust machine learning in adversarial environments. Discover how machine learning systems can adapt when an adversary actively poisons data to manipulate statistical inference, learn the latest practical techniques for investigating system security and performing robust data analysis, and gain insight into new approaches for designing effective countermeasures against the latest wave of cyber-attacks. Privacy-preserving mechanisms and the near-optimal evasion of classifiers are discussed in detail, and in-depth case studies on email spam and network security highlight successful attacks on traditional machine learning algorithms. Providing a thorough overview of the current state of the art in the field, and possible future directions, this groundbreaking work is essential reading for researchers, practitioners and students in computer security and machine learning, and those wanting to learn about the next stage of the cybersecurity arms race.

Summary

Machine learning algorithms provide the ability to quickly adapt and find patterns in large diverse data sources and therefore are a potential asset to application developers in enterprise systems, networks, and security domains. They make analyzing the security implications of these tools a critical task for machine learning researchers and practitioners alike, spawning a new subfield of research into adversarial learning for security-sensitive domains. The work presented in this book advanced the state of the art in this field of study with five primary contributions: a taxonomy for qualifying the security vulnerabilities of a learner, two novel practical attack/defense scenarios for learning in real-world settings, learning algorithms with theoretical guarantees on training-data privacy preservation, and a generalization of a theoretical paradigm for evading detection of a classifier. However, research in adversarial machine learning has only begun to address the field's complex obstacles—many challenges remain. These challenges suggest several new directions for research within both fields of machine learning and computer security. In this chapter we review our contributions and list a number of open problems in the area.

Above all, we investigated both the practical and theoretical aspects of applying machine learning in security domains. To understand potential threats, we analyzed the vulnerability of learning systems to adversarial malfeasance. We studied both attacks designed to optimally affect the learning system and attacks constrained by real-world limitations on the adversary's capabilities and information.We further designed defense strategies, which we showed significantly diminish the effect of these attacks. Our research focused on learning tasks in virus, spam, and network anomaly detection, but also is broadly applicable across many systems and security domains and has farreaching implications to any system that incorporates learning. Here is a summary of the contributions of each component of this book followed by a discussion of open problems and future directions for research.

Framework for Secure Learning

The first contribution discussed in this book was a framework for assessing risks to a learner within a particular security context (see Table 3.1). The basis for this work is a taxonomy of the characteristics of potential attacks. From this taxonomy (summarized in Table 9.1), we developed security games between an attacker and defender tailored to the particular type of threat posed by the attacker.

Summary

Adversaries can also execute attacks designed to degrade the classifier's ability to distinguish between allowed and disallowed events. These Causative Availability attacks against learning algorithms cause the resulting classifiers to have unacceptably high false-positive rates; i.e., a successfully poisoned classifier will misclassify benign input as potential attacks, creating an unacceptable level of interruption in legitimate activity. This chapter provides a case study of one such attack on the SpamBayes spam detection system. We show that cleverly crafted attack messages—pernicious spam email that an uninformed human user would likely identify and label as spam—can exploit Spam- Bayes' learning algorithm, causing the learned classifier to have an unreasonably high false-positive rate. (Chapter 6 demonstrates Causative attacks that instead result in classifiers with an unreasonably high false-negative rate—these are Integrity attacks.) We also show effective defenses against these attacks and discuss the tradeoffs required to defend against them.

We examine several attacks against the SpamBayes spam filter, each of which embodies a particular insight into the vulnerability of the underlying learning technique. In doing so, we more broadly demonstrate attacks that could affect any system that uses a similar learning algorithm. The attacks we present target the learning algorithm used by the spam filter SpamBayes (spambayes.sourceforge.net), but several other filters also use the same underlying learning algorithm, including BogoFilter (bogofilter.sourceforge. net), the spam filter in Mozilla's Thunderbird email client (mozilla.org), and the machine learning component of SpamAssassin (spamassassin.apache.org). The primary difference between the learning elements of these three filters is in their tokenization methods; i.e., the learning algorithm is fundamentally identical, but each filter uses a different set of features. We demonstrate the vulnerability of the underlying algorithm for SpamBayes because it uses a pure machine learning method, it is familiar to the academic community (Meyer & Whateley 2004), and it is popular with over 700,000 downloads. Although here we only analyze SpamBayes, the fact that these other systems use the same learning algorithm suggests that other filters are also vulnerable to similar attacks. However, the overall effectiveness of the attacks would depend on how each of the other filters incorporates the learned classifier into the final filtering decision.

Pattern Recognition and Machine Learning

Refine search

Refine search

Actions for selected content:

2275 results in Pattern Recognition and Machine Learning

15 - Big Data and NoSQL

Summary

10 - Implementing Association Mining with Weka and R

Summary

11 - Web Mining and Search Engines

Summary

Contents

Preface

Summary

7 - Cluster Analysis

Summary

6 - Implementing Classification in Weka and R

Summary

2 - Introduction to Data Mining

Summary

13 - Data Warehouse Schema

Summary

Acknowledgments

Colour Plates

1 - Beginning with Machine Learning

Summary

3 - Beginning with Weka and R Language

Summary

Data Mining and Data Warehousing

Adversarial Machine Learning

9 - Adversarial Machine Learning Challenges

Summary

8 - Principal component analysis in high dimensions

5 - Metric entropy and its uses

Illustrations

5 - Availability Attack Case Study: SpamBayes

Summary

Pattern Recognition and Machine Learning

Refine search

Refine search

Actions for selected content:

Save Search

2275 results in Pattern Recognition and Machine Learning

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Data Mining and Data Warehousing

Adversarial Machine Learning

Summary

Summary