Chapter Objectives
• To define clustering, explain its applications and features.
• To explain various proximity measures for data clustering.
• To discuss various clustering techniques.
• To explain the working principle of the k-means clustering algorithm.
• To discuss hierarchical clustering and its types.
• To discuss agglomerative and divisive clustering techniques.
• To describe the concept of the DBSCAN algorithm.
12.1 Introduction to Clustering
In machine learning (ML), labeling the data is one of the crucial tasks. But sometimes, we do not have the labeled data. Even though the data is not labeled, we can still analyze it using clustering techniques. As you know, algorithms in ML are broadly classified as supervised and unsupervised techniques. In the case of supervised learning, the input data points/examples are labeled, while in unsupervised learning, the input data points/examples are not labeled. Cluster analysis, also called clustering, comes under unsupervised learning. Here, the input data points are not labeled. Clustering is the most popular technique in unsupervised learning.
Clustering is defined as grouping the input data points into various clusters/groups based on their similarity.
A cluster contains objects that are more similar to each other. In other words, during cluster analysis, the data is grouped into classes or clusters, so that records within a cluster (intra-cluster) have high similarity with one another but have high dissimilarities in comparison to objects in other clusters (inter-cluster).
The clustering algorithm aims to minimize the intra-cluster distance and maximize the inter-cluster distance, as shown in Figure 12.1.
An example of clustering is shown in Figure 12.2. Here, records in the input have different shapes. Here, we only have the input data without any label of shape. After applying the clustering algorithm, they were classified into three types of clusters. Here, the clustering algorithm considers the dimensions of the object and its color as the input features. Records whose features are highly similar are gathered to form a single cluster. In this case, we get three clusters representing three types of records, i.e., it clusters rhombus, circle, and triangle separately, as shown in Figure 12.2.