Density Based Clustering:
DBSCAN
Why DBSCAN over K-Means or Hierarchical
clustering?
• In the figure first two columns, it is clear that DBSCAN correctly
  identifies clusters in circular and spiral patterns. Whereas, K-Means fails
  because it forces clusters into spherical shapes.
• In third column, DBSCAN correctly identifies dense clusters while
  marking noise points separately. Other side, K-Means incorrectly assigns
  noise points to clusters.
• In forth column, well-separated dense clusters. Both DBSCAN and K-
  Means perform well here.
• In fifth column, data are uniformly distributed. DBSCAN classifies
  everything as one cluster (no clear separation). K-Means still assigns
  multiple clusters based on distance.
When to use DBSCAN Clustering
• Use DBSCAN when:
   • Clusters have irregular shapes (e.g., spirals, rings).
   • Noise or outliers are present.
   • Varying density is expected.
• Use K-Means when:
   • Clusters are well-separated and spherical.
   • The dataset is large and high-dimensional.
   • Noise is minimal.
Density-based spatial clustering of applications
with noise (DBSCAN)
• DBSCAN is a density-based clustering algorithm that groups data points
  based on density rather than distance.
• DBSCAN can effectively discover clusters of any arbitrary shape.
• Noise points, which do not meet the density criteria, are labeled as outlier.
• Clusters are usually in high-density regions and outliers tend to be in low-
  density regions.
• It requires minimum domain knowledge, can discover clusters of arbitrary
  shape.
• In the DBSCAN algorithm, only two key hyperparameters are required to
  identify clusters in a dataset: the first parameter, epsilon (ε), defines the
  radius of the neighborhood within which points are considered neighbors,
  and the second parameter, MinPts (minimum points), specifies the minimum
  number of data points required within this radius for a point to be classified
  as a core point.
Terminologies
• 1. Epsilon (ε)
• The radius within which DBSCAN searches for neighboring points.
• A point is considered part of a cluster if it has enough neighboring
  points within ε.
• Choosing the right ε value is crucial:
    • Too small → Many small clusters
    • Too large → Merging of distinct clusters.
• 2. Minimum Points (MinPts)
   ➢The minimum number of points required within the ε-radius for a
    point to be considered a core point.
   ➢Typically set as MinPts ≥ D+1, where D is the dataset's dimensionality.
• 3. Core Point
   ➢A point that has at least MinPts points (including itself) within its ε-
    radius.
   ➢Forms the center of a cluster.
• 4. Border Point
   ➢A point that falls within the ε-radius of a core point but does not have
    enough neighbors to be a core point itself.
   ➢Unlike Core Points, Non-Core (border) Points can only join a cluster
    and can not be used to extend it further.
• DBSCAN operates on the idea of density-reachable and density-connected.
• 4. Density-reachable
    ➢A point q is density-reachable from a point p using Eps and MinPts if
     there is a chain of points 𝑃1 , . . . , 𝑃𝑛 , 𝑃1 = 𝑝, 𝑃𝑛 = 𝑞 such that 𝑃𝑖+1 is
     directly density-reachable from 𝑃𝑖 .
    ➢Two border points of the same cluster Cluster are possibly not density
     reachable from each other because the core point condition might not
     hold for both of them.
    ➢However, there must be a core point in a Cluster from which both border
     points of C are density-reachable.
• 5. Density-connected
    ➢A point ‘p’ is density-connected to a point ‘q’, if there is a point ‘o’ such
     that both ‘p’ and ‘q’ are density-reachable from ‘o’ using Eps and MinPts.
    ➢Intuitively, a cluster is defined to be a set of density-connected points
     which is maximal with respected to density-reachable.
Finding Clusters in a 2D Dataset
• We have the following dataset representing points in a 2D space:
• (1, 2), (2, 2), (2, 3), (8, 8), (8, 9), (25, 80), (9, 8)
• We set: ε (radius) = 2 and MinPts = 3
• Now, DBSCAN follows these steps:
1. Identify Core Points
• A point is a core point if at least MinPts = 3 points (including itself) are
  within the ε-radius.
• Consider (2,2): Within ε = 2, we find: (1,2), (2,2), (2,3) → Total: 3
  points that means point (2, 2) is a (Core point).
• Consider (8,8): Within ε = 2, we find: (8,8), (8,9), (9,8) → Total: 3
  points that means point (8, 8) is a (Core point).
2. Identify Border Points
• A border point is within the ε-radius of a core point but does not have
  enough neighbors to be a core point itself.
• Points (1,2) and (2,3) are within the ε-radius of core point (2,2), so
  they are border points.
3. Identify Noise Points
• Points that do not belong to any cluster (not a core or border point)
  are outliers (noise).
• Point (25,80) has no neighbors within ε = 2, so it is classified as noise
  or outlier.
How DBSCAN Works
• Set Parameters (ε and minPts): Choose an epsilon (ε) distance and a
  minimum points (minPts) count for density.
• Find Neighbor Points: For each point, identify neighboring points
  within distance ε.
• Form New Cluster: If a point has at least minPts neighbors, it forms
  the core of a new cluster.
• Expand Cluster: Add all density-reachable points to the cluster.
• Label Noise: Points not meeting density requirements are labeled as
  noise.
• Q. Given the points A(3, 7), B(4, 6), C(5, 5), D(6, 4), E(7, 3), F(6, 2),
  G(7, 2) and H(8, 4), Find the core points and outliers using DBSCAN.
  Take Eps = 2.5 and MinPts = 3.
• Solution:
• Given, Epsilon (Eps) = 2.5
• Minimum Points (MinPts) = 3
• Let’s represent the given data points in tabular form:
• Step 1: To find the core points, outliers and clusters by using DBSCAN,
  we need to first calculate the distance among all pairs of given data
  point. Let us use Euclidean distance measure for distance calculation.
• The distance matrix as follows:
• In the above table, Distance ≤ Epsilon (i.e. 2.5) is marked red.
• Step 2: Now, finding all the data points that lie in the Eps-neighborhood
  of each data points. That is, put all the points in the neighborhood set of
  each data point whose distance is <=2.5.
   N(A) = {B}; — — — — — — -→ because distance of B is <= 2.5 with A
   N(B) = {A, C}; — — — — — → because distance of A and C is <= 2.5 with B
   N(C) = {B, D}; — — — — —→ because distance of B and D is <=2.5 with C
   N(D) = {C, E, F, G, H}; — → because distance of C, E, F,G and H is <=2.5 with D
   N(E) = {D, F, G, H}; — — → because distance of D, F, G and H is <=2.5 with E
   N(F) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with F
   N(G) = {D, E, F, H}; — — -→ because distance of D, E, F and H is <=2.5 with G
   N(H) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with H
• Data points B, C, D, E, F, G and H have neighbors >= MinPts (i.e. 3) and
  hence are the core data points.
• Data point A is a border point.
• There exist no outliers in the given set of data points.
Advantages of DBSCAN
• Arbitrary Cluster Shapes: Handles clusters of varying shapes and
  densities.
• No Pre-Specified K: Automatically determines the number of clusters.
• Robust to Outliers: Noise points are left out of clusters, reducing
  skew.
Disadvantages of DBSCAN
• Sensitive to Parameters: Results depend on careful tuning of ε and
  minPts.
• Challenges with High Dimensions: Suffers from the curse of
  dimensionality.
• Difficulty with Density Variation: Clusters with different densities can
  be hard to capture.
Comparison of K-Means, Hierarchical, & DBSCAN