The representative-based clustering methods like K-means and expectation maximization are suitable for finding ellipsoid-shaped clusters, or at best convex clusters. However, for nonconvex clusters, such as those shown in Figure 15.1, these methods have trouble finding the true clusters, as two points from different clusters may be closer than two points in the same cluster. The density-based methods we consider in this chapter are able to mine such nonconvex clusters.
THE DBSCAN ALGORITHM
Density-based clustering uses the local density of points to determine the clusters, rather than using only the distance between points. We define a ball of radius ε around a point x ∈ Rd, called the ε-neighborhood of x, as follows:
Nε(x) = Bd(x, ε) = {y | δ(x,y) ≤ ε}
Here δ(x,y) represents the distance between points x and y, which is usually assumed to be the Euclidean distance, that is, δ(x, y) = ||x − y||2. However, other distance metrics can also be used.
For any point x ∈ D, we say that x is a core point if there are at least minpts points in its ε-neighborhood. In other words, x is a core point if |Nε(x)| ≥ minpts, where minpts is a user-defined local density or frequency threshold.