Understanding HDBSCAN: A Deep Dive into Hierarchical Density-Based Clustering
Understanding HDBSCAN: A Deep Dive into Hierarchical Density-Based Clustering

Introduction

Data, as we know it, is the new oil of the digital era. With the enormous surge in data production, methods to extract meaningful insights from this data are more essential than ever. One such essential data analysis technique is clustering, a type of unsupervised machine learning. Clustering plays a crucial role in helping us understand the inherent structure and patterns in data by grouping similar data points together, thereby unearthing the hidden relationships between them.

Clustering algorithms find extensive usage across a broad spectrum of fields. From segmenting customers for targeted outreach in marketing to discovering gene families in bioinformatics, clustering plays a pivotal role. However, not all clustering algorithms are created equal. Some struggle with high-dimensional data, others are sensitive to the initial parameters set, and many can’t efficiently handle large, diverse datasets. Among the various clustering algorithms out there, one that stands out for its robustness and versatility is HDBSCAN – Hierarchical Density-Based Spatial Clustering of Applications with Noise.

In this blog post, we’ll do a deep dive into the HDBSCAN algorithm. We will explore its underlying principles, understand its implementation, identify its strengths and weaknesses, and see how it surpasses other traditional clustering algorithms in handling complex, real-world data. Whether you’re a seasoned data scientist or brand new to the field, HDBSCAN is a great algorithm to have in your toolbox.  

Overview of HDBSCAN

At its core, density-based clustering operates on the premise that clusters are dense regions in the data space separated by regions that are relatively sparse. The algorithm essentially seeks areas in the dataset where there are lots of data points (high density), and separates these regions from areas with few data points (low density).

To better understand this, imagine a room full of people, where groups of people are standing close together having conversations, with empty spaces between these groups. In this scenario, each group of people can be considered a cluster. Density-based clustering would be able to identify these groups, as they are areas of high density, separated by areas of low density.

One significant advantage of density-based clustering methods is their ability to discover clusters of arbitrary shapes, which is not possible with other clustering algorithms like k-means. Furthermore, density-based clustering algorithms have the capability to identify noise or outliers, as these are typically far from high density areas.

The core idea of density-based clustering is well illustrated by algorithms such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise), where the density around each data point is calculated, and regions of sufficiently high density are determined to be clusters. However, a limitation of DBSCAN is its inability to find clusters of varying densities.

This is where HDBSCAN shines. HDBSCAN stands out from other clustering algorithms due to its capability of discerning clusters of different densities and its ability to handle noise in the data. The algorithm first constructs a hierarchy of clusters and then prunes this hierarchy to find the most stable clusters, hence enabling the detection of clusters at different scales.

At a high level, HDBSCAN begins by transforming the space according to the density of the data points. It then constructs a minimum spanning tree of the data, which it uses to build a hierarchy of clusters. These clusters are then pruned based on their stability, a measure derived from their density and lifespan in the hierarchy. Ultimately, the algorithm outputs a set of clusters that best match the underlying structure of the data, disregarding points it classifies as noise.

By creating a hierarchy of clusters and then pruning this hierarchy based on the stability of each cluster, HDBSCAN overcomes the limitations of DBSCAN, and is able to find clusters of varying densities. This makes HDBSCAN a powerful and flexible tool for clustering tasks, especially when dealing with complex, high-dimensional datasets. In the section below, we’ll dive deeper into the technical details of the algorithm. 

Inner-Workings of HDBSCAN

In our quest to better understand HDBSCAN, it is essential to peer under its hood and familiarize ourselves with the underlying mechanics of the algorithm. HDBSCAN, an acronym for Hierarchical Density-Based Spatial Clustering of Applications with Noise, has emerged as a highly effective and flexible tool for clustering tasks, particularly with complex and high-dimensional data.

HDBSCAN is a significant step beyond earlier density-based clustering algorithms like DBSCAN, primarily due to its ability to identify clusters of varying densities and its hierarchical approach to clustering. This approach allows the algorithm to determine the optimal number of clusters, a much-coveted feature not often seen in other clustering algorithms. Let’s dive into the steps that HDBSCAN takes to form the clusters. 

1. Transforming the Space

The first step in HDBSCAN is to transform the feature space using a metric known as the “mutual reachability distance.” Unlike simple Euclidean or Manhattan distances, this metric incorporates density into the distance calculation between points. For any two data points 

a and b, the mutual reachability distance is defined as:

In the equation above, d(a,b) is just the distance between point a and b, while coreDistance is defined as the distance from a point to its kth nearest neighbor. The value of k is usually left up to the user to choose, and effectively incorporates local density into the calculation. 

Transforming the space with mutual reachability distances helps to emphasize the density-based relationships between points, making the algorithm sensitive to local density variations. This step sets the stage for the rest of the HDBSCAN algorithm.

2. Constructing the Minimum Spanning Tree (MST)

Once the mutual reachability distances between all pairs of points are calculated, the next step is to construct a Minimum Spanning Tree (MST) based on these distances. An MST is a tree that connects all the points in such a way that the sum of the edge weights (in this case, mutual reachability distances) is minimized.

Algorithms like Prim’s can be used to efficiently construct the MST. The edges of the MST represent the relationships based on mutual reachability, connecting points in a way that highlights the underlying density structure.

3. Building the Hierarchy

HDBSCAN constructs a hierarchical tree from the MST, a structure known as the “cluster hierarchy.” In this process, each leaf node of the tree corresponds to a single data point, and as we traverse upwards, nodes merge based on edge weight—i.e., mutual reachability distance—in the MST.

Starting from leaf nodes, the algorithm iteratively merges nodes or clusters together based on the smallest edge weight, moving upwards to form a complete binary tree known as the dendrogram.

4. Condensing the Tree

After building the cluster hierarchy, the next step is to condense this tree. This process relies on the notion of “stability” of clusters over varying density thresholds. Clusters that are less stable over these thresholds are pruned. The result is a “condensed cluster tree,” which is a simplified version of the initial hierarchical tree. It retains only those clusters that have shown enough stability over the various density levels.

5. Extracting the Final Clusters

The last step in the HDBSCAN algorithm involves extracting the final clusters from the condensed tree. Every data point is then labeled according to the cluster it belongs to in this condensed tree. Notably, some points may not belong to any cluster, classified as “noise,” and typically assigned a label of -1. The flexibility in the algorithm allows it to identify clusters of different shapes and densities, providing a nuanced categorization compared to many other clustering algorithms.

For more information on the inner workings of HDBSCAN, check out the documentation here

Implementing HDBSCAN

In this section, we’ll outline how to implement HDBSCAN using the hdbscan python package. First, we’ll need to install some libraries in order to run the code in this section. 

pip install hdbscan
pip install numpy
pip install matplotlib
pip install scikit-learn

Once the above libraries are installed, we can very easily run the HDBSCAN algorithm in just a few lines of code. The below code accomplishes a few steps:

  1. Import necessary libraries
  2. Create a synthetic dataset to cluster 
  3. Create a clustering object using the HDBSCAN algorithm 
  4. Fit HDBSCAN to the synthetic dataset
  5. Plot the results 
import hdbscan
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

# Create synthetic dataset using moons with some noise
X, y = make_moons(n_samples=1000, noise=0.1, random_state=0)

# Create an HDBSCAN object
cluster_algorithm = hdbscan.HDBSCAN(min_cluster_size=10)

# Fit the model to the data
cluster_algorithm.fit(X)

# Get the cluster labels (Note: -1 means noise points)
labels = cluster_algorithm.labels_

# Get the number of clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Spectral', s=40)
plt.colorbar()
plt.title(f'Number of clusters: {n_clusters}')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

From the above plot, we see that we get 2 clusters with some points classified as noise when using HDBSCAN as our clustering algorithm. This is exactly what we wanted to see with a dataset like this! This code example also illustrates how easy it is to implement the HDBSCAN algorithm yourself! 

Potential Pitfalls and Limitations of HDBSCAN

Despite its notable strengths and versatility, it’s crucial to understand that like any algorithm, HDBSCAN has its limitations. A proper understanding of these can help data scientists and machine learning practitioners effectively use HDBSCAN and avoid missteps in their analysis.

1. Computational Complexity

One of the significant challenges of HDBSCAN is its computational complexity. While HDBSCAN does outperform DBSCAN in computational performance (see HDBSCAN docs here for reference), it does need to construct a minimum spanning tree and create a cluster hierarchy. Both of these steps are computationally intensive and can become a bottleneck when dealing with very large datasets.

2. No Guarantee of Cluster Assignment

Unlike partition-based clustering algorithms such as k-means, HDBSCAN doesn’t guarantee that every point will be assigned to a cluster. Points that do not fit well into any cluster are classified as noise. While this can be a strength in datasets with actual noisy points, it could be a limitation if the aim is to cluster all data points.

3. Difficulty with High-Dimensional Data

Although HDBSCAN generally handles high-dimensional data better than some other algorithms, it is not entirely immune to the “curse of dimensionality”. This term refers to the problem that as the dimensionality of a dataset increases, the volume of the data space increases so rapidly that the data available becomes sparse, making clustering more challenging. However, this limitation is true of any density-based clustering algorithm. 

While HDBSCAN is a robust and flexible clustering tool, being aware of these limitations can help guide its application. By understanding when and how to use it, data scientists and machine learning practitioners can make the most of HDBSCAN, while also recognizing when an alternative clustering approach might be more appropriate.

How Arize uses HDBSCAN

Arize incorporates the HDBSCAN algorithm into its embedding visualization and analysis toolset. Working alongside UMAP for dimensionality reduction, the platform efficiently identifies and visualizes clusters within your embeddings. But the process goes beyond simple identification; for each detected cluster, Arize calculates an array of metrics, including drift scores and performance indicators. This setup offers a practical way to troubleshoot problematic clusters and uncover potential clusters that may have otherwise gone unnoticed, which can subsequently be exported and labeled for further analysis. The integration of these techniques allows for a more nuanced understanding of the data at hand. See more information about Arize’s embeddings analysis tool here

Conclusion

In this comprehensive exploration of the HDBSCAN algorithm, we’ve traveled from its foundational principles to its intricate processes, and even to its practical applications. Starting with an overview, we peeled back the layers to look closely at how HDBSCAN transforms the feature space through mutual reachability distances and constructs a Minimum Spanning Tree to capture the essence of the data. We understood how the hierarchy is built and later pruned to reveal the most stable and significant clusters, ultimately assigning each data point to its rightful cluster—or declaring it as noise.

We moved from theory to practice, discussing how to implement HDBSCAN with real-world data. But like any algorithm, HDBSCAN is not without its limitations. We explored its potential pitfalls, such as the computational complexity it inherits from the Minimum Spanning Tree construction, the lack of guaranteed cluster assignments for all data points, and its sometimes less-than-stellar performance on high-dimensional data.

Lastly, we saw how Arize uses HDBSCAN in a practical context, offering a real-world application that underscores the algorithm’s utility.

While not a one-size-fits-all solution, HDBSCAN has its place in the vast ecosystem of clustering algorithms. Understanding its inner workings not only broadens our academic knowledge but also equips us with another instrument for tackling the ever-complex challenges posed by real-world data.

References

https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html

Explore More Usescases