Search results for Pattern Recognition and Machine Learning

9 - Summarizing Itemsets
from PART TWO - FREQUENT PATTERN MINING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 242-258
- Chapter
- Export citation
Summary

The search space for frequent itemsets is usually very large and it grows exponentially with the number of items. In particular, a low minimum support value may result in an intractable number of frequent itemsets. An alternative approach, studied in this chapter, is to determine condensed representations of the frequent itemsets that summarize their essential characteristics. The use of condensed representations can not only reduce the computational and storage demands, but it can also make it easier to analyze the mined patterns. In this chapter we discuss three of these representations: closed, maximal, and nonderivable itemsets.
MAXIMAL AND CLOSED FREQUENT ITEMSETS
Given a binary database D ⊆ T × I, over the tids T and items I, let F denote the set of all frequent itemsets, that is,
F = {X | X ⊆ I and sup(X) ≥ minsup}
Maximal Frequent Itemsets
A frequent itemset X ∈ F is called maximal if it has no frequent supersets. Let M be the set of all maximal frequent itemsets, given as
M= {X | X ∈ F and 6 ∃Y ⊃ X, such that Y ∈ F}
The set M is a condensed representation of the set of all frequent itemset F, because we can determine whether any itemset X is frequent or not using M.

5 - Kernel Methods
from PART ONE - DATA ANALYSIS FOUNDATIONS
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 134-162
- Chapter
- Export citation
Summary

Before we can mine data, it is important to first find a suitable data representation that facilitates data analysis. For example, for complex data such as text, sequences, images, and so on, we must typically extract or construct a set of attributes or features, so that we can represent the data instances as multivariate vectors. That is, given a data instance x (e.g., a sequence), we need to find a mapping φ, so that φ(x) is the vector representation of x. Even when the input data is a numeric data matrix, if we wish to discover nonlinear relationships among the attributes, then a nonlinear mapping φ may be used, so that φ(x) represents a vector in the corresponding high-dimensional space comprising nonlinear attributes. We use the term input space to refer to the data space for the input data x and feature space to refer to the space of mapped vectors φ(x). Thus, given a set of data objects or instances xi, and given a mapping function φ, we can transform them into feature vectors φ(xi), which then allows us to analyze complex data instances via numeric analysis methods.

17 - Clustering Validation
from PART THREE - CLUSTERING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 425-464
- Chapter
- Export citation
Summary

There exist many different clustering methods, depending on the type of clusters sought and on the inherent data characteristics. Given the diversity of clustering algorithms and their parameters it is important to develop objective approaches to assess clustering results. Cluster validation and assessment encompasses three main tasks: clustering evaluation seeks to assess the goodness or quality of the clustering, clustering stability seeks to understand the sensitivity of the clustering result to various algorithmic parameters, for example, the number of clusters, and clustering tendency assesses the suitability of applying clustering in the first place, that is, whether the data has any inherent grouping structure. There are a number of validity measures and statistics that have been proposed for each of the aforementioned tasks, which can be divided into three main types:
External: External validation measures employ criteria that are not inherent to the dataset. This can be in form of prior or expert-specified knowledge about the clusters, for example, class labels for each point.
Internal: Internal validation measures employ criteria that are derived from the data itself. For instance, we can use intracluster and intercluster distances to obtain measures of cluster compactness (e.g., how similar are the points in the same cluster?) and separation (e.g., how far apart are the points in different clusters?).

Index
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 585-593
- Chapter
- Export citation

10 - Sequence Mining
from PART TWO - FREQUENT PATTERN MINING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 259-279
- Chapter
- Export citation
Summary

Many real-world applications such as bioinformatics, Web mining, and text mining have to deal with sequential and temporal data. Sequence mining helps discover patterns across time or positions in a given dataset. In this chapter we consider methods to mine frequent sequences, which allow gaps between elements, as well as methods to mine frequent substrings, which do not allow gaps between consecutive elements.
FREQUENT SEQUENCES
Let ∑ denote an alphabet, defined as a finite set of characters or symbols, and let |∑| denote its cardinality. A sequence or a string is defined as an ordered list of symbols, and is written as s = s1s2 …sk, where si ∈ ∑ is a symbol at position i, also denoted as s[i]. Here |s| = k denotes the length of the sequence. A sequence with length k is also called a k-sequence. We use the notation s[i : j] = sisi+1 … sj−1sj to denote the substring or sequence of consecutive symbols in positions i through j, where j > i. Define the prefix of a sequence s as any substring of the form s[1 : i] = s1s2 …si, with 0 ≤ i ≤ n. Also, define the suffix of s as any substring of the form s[i : n] = sisi+1 …sn, with 1 ≤ i ≤ n+1. Note that s[1 : 0] is the empty prefix, and s[n + 1 : n] is the empty suffix.

21 - Support Vector Machines
from PART FOUR - CLASSIFICATION
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 514-547
- Chapter
- Export citation

PART TWO - FREQUENT PATTERN MINING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 215-216
- Chapter
- Export citation

22 - Classification Assessment
from PART FOUR - CLASSIFICATION
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 548-584
- Chapter
- Export citation
Summary

We have seen different classifiers in the preceding chapters, such as decision trees, full and naive Bayes classifiers, nearest neighbors classifier, support vector machines, and so on. In general, we may think of the classifier as a model or function M that predicts the class label ŷ for a given input example x:
ŷ = M(x)
where x = (x1, x2, …, xd)T ∈ Rd is a point in d-dimensional space and ŷ ∈ {c1, c2, …, ck} is its predicted class.
To build the classification model M we need a training set of points along with their known classes. Different classifiers are obtained depending on the assumptions used to build the model M. For instance, support vector machines use the maximum margin hyperplane to construct M. On the other hand, the Bayes classifier directly computes the posterior probability P(cj|x) for each class cj, and predicts the class of x as the one with the maximum posterior probability, ŷ = argmaxcj {P(cj|x)}. Once the model M has been trained, we assess its performance over a separate testing set of points for which we know the true classes. Finally, the model can be deployed to predict the class for future points whose class we typically do not know.

19 - Decision Tree Classifier
from PART FOUR - CLASSIFICATION
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 481-497
- Chapter
- Export citation

16 - Spectral and Graph Clustering
from PART THREE - CLUSTERING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 394-424
- Chapter
- Export citation

Contents
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp v-viii
- Chapter
- Export citation

8 - Itemset Mining
from PART TWO - FREQUENT PATTERN MINING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 217-241
- Chapter
- Export citation
Summary

In many applications one is interested in how often two or more objects of interest co-occur. For example, consider a popular website, which logs all incoming traffic to its site in the form of weblogs. Weblogs typically record the source and destination pages requested by some user, as well as the time, return code whether the request was successful or not, and so on. Given such weblogs, one might be interested in finding if there are sets of web pages that many users tend to browse whenever they visit the website. Such “frequent” sets of web pages give clues to user browsing behavior and can be used for improving the browsing experience.
The quest to mine frequent patterns appears in many other domains. The prototypical application is market basket analysis, that is, to mine the sets of items that are frequently bought together at a supermarket by analyzing the customer shopping carts (the so-called “market baskets”). Once we mine the frequent sets, they allow us to extract association rules among the item sets, where we make some statement about how likely are two sets of items to co-occur or to conditionally occur. For example, in the weblog scenario frequent sets allow us to extract rules like, “Users who visit the sets of pages main, laptops and rebates also visit the pages shopping-cart and checkout”, indicating, perhaps, that the special rebate offer is resulting in more laptop sales.

13 - Representative-based Clustering
from PART THREE - CLUSTERING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 333-363
- Chapter
- Export citation

Frontmatter
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp i-iv
- Chapter
- Export citation

20 - Linear Discriminant Analysis
from PART FOUR - CLASSIFICATION
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 498-513
- Chapter
- Export citation

Preface
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp ix-xii
- Chapter
- Export citation
Summary

This book is an outgrowth of data mining courses at Rensselaer Polytechnic Institute (RPI) and Universidade Federal de Minas Gerais (UFMG); the RPI course has been offered every Fall since 1998, whereas the UFMG course has been offered since 2002. Although there are several good books on data mining and related topics, we felt that many of them are either too high-level or too advanced. Our goal was to write an introductory text that focuses on the fundamental algorithms in data mining and analysis. It lays the mathematical foundations for the core data mining methods, with key concepts explained when first encountered; the book also tries to build the intuition behind the formulas to aid understanding.
The main parts of the book include exploratory data analysis, frequent pattern mining, clustering, and classification. The book lays the basic foundations of these tasks, and it also covers cutting-edge topics such as kernel methods, high-dimensional data analysis, and complex graphs and networks. It integrates concepts from related disciplines such as machine learning and statistics and is also ideal for a course on data analysis. Most of the prerequisite material is covered in the text, especially on linear algebra, and probability and statistics.
The book includes many examples to illustrate the main technical concepts. It also has end-of-chapter exercises, which have been used in class. All of the algorithms in the book have been implemented by the authors.

PART ONE - DATA ANALYSIS FOUNDATIONS
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 31-32
- Chapter
- Export citation

18 - Probabilistic Classification
from PART FOUR - CLASSIFICATION
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 467-480
- Chapter
- Export citation

PART THREE - CLUSTERING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 331-332
- Chapter
- Export citation

7 - Dimensionality Reduction
from PART ONE - DATA ANALYSIS FOUNDATIONS
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 183-214
- Chapter
- Export citation

Pattern Recognition and Machine Learning

Refine search

Refine search

Actions for selected content:

2327 results in Pattern Recognition and Machine Learning

9 - Summarizing Itemsets

Summary

5 - Kernel Methods

Summary

17 - Clustering Validation

Summary

Index

10 - Sequence Mining

Summary

21 - Support Vector Machines

PART TWO - FREQUENT PATTERN MINING

22 - Classification Assessment

Summary

19 - Decision Tree Classifier

16 - Spectral and Graph Clustering

Contents

8 - Itemset Mining

Summary

13 - Representative-based Clustering

Frontmatter

20 - Linear Discriminant Analysis

Preface

Summary

PART ONE - DATA ANALYSIS FOUNDATIONS

18 - Probabilistic Classification

PART THREE - CLUSTERING

7 - Dimensionality Reduction

Pattern Recognition and Machine Learning

Refine search

Refine search

Actions for selected content:

Save Search

2327 results in Pattern Recognition and Machine Learning

Summary

Summary

Summary

Summary

Summary

Summary

Summary