HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.
- scikit-learn-contrib/hdbscan: A high performance implementation of HDBSCAN clustering.
- Talk about good when there are clusters of different densities in the data to do DBSCAN for different epsilons
HDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.
- How HDBSCAN Works — hdbscan 0.8.1 documentation
- Convert DBSCAN to [hierarchical clustering
If you read this again after reading a lot, you’ll see that the person who wrote this has done a very good job of understanding and illustrating it very well.
- Understanding HDBSCAN and Density-Based Clustering
- Agglomerative Hierarchical Clustering (Single Linkage) in the space of [Mutual reach)
Increasing min_samples has the effect of smoothing the density estimation
- Small bumps disappear
Accelerated Hierarchical Density Clustering
Density-based Clustering https://findresearcher.sdu.dk/ws/files/171664269/widm.1343.pdf The paper proposes a method for speeding up the HDBSCAN* algorithm, but of particular interest is that it describes the algorithm from three different perspectives: 1.
- statistical perspective: 1.
- Assume data are sampled from an unknown density function
- Construct cluster trees from a level set of density functions
- Extend the idea of [Robust Single Linkage
-
Single linkage clustering in Euclidean space is more susceptible to noise because noisy points may form false bridges across the island. By embedding the points in λ-space, the clustering becomes much more robust to noise due to the “repulsion effect.” --- Understanding HDBSCAN and Density-Based Clustering
-
- computational perspective: 1.
- Interpreted as a hierarchical extension of DBSCAN parameter epsilon to search by all values
- Introducing a new distance metric called mutual reachability distance - Mutual reach (Interconnection distance)
- Cluster extraction based on tree simplification and stability scores
- topological perspective
- Using the concept of persistent homology
- Representation of cluster structure using [sheaf theory (cosmology)
- Formulate persistence over distance scales These three perspectives show that the same algorithm can be interpreted in different theoretical frameworks, providing an easily understandable explanation for researchers in each field.
- Using the concept of persistent homology
In particular, the topological perspective is a new contribution of this paper, suggesting that it allows for multiparametric extensions and the application of more general mathematical tools.
This theoretical integration is important as it deepens our understanding of HDBSCAN* and paves the way for further improvements. It also provides a basis for researchers from different disciplines to collaborate and improve the algorithm.
Understanding HDBSCAN and Density-Based Clustering This blog post is an explanation intended to help you intuitively understand the HDBSCAN* algorithm. The main points are:.
- basic features of HDBSCAN*: 1.
- Robust against noisy data
- Fewer assumptions about cluster shape and density
- Unlike K-means, no need to specify the number of clusters in advance
- the concept of density-based clustering: 1.
- Define a cluster as “a dense region separated by a sparse region
- Capture the peaks and valleys of the probability density function (PDF) as clusters
- [Cluster persistence
- method of density estimation:.
- Use the distance to the kth nearest neighbor (core distance)
- Introducing a new distance metric called [mutual reachability distance
- This has the effect of “repelling” points in sparse regions
- Parameter setting: 1.
- min_samples: controls PDF estimation smoothness
- min_cluster_size: Specifies the minimum cluster size, ignoring small variations
- implementation flow
- core distance calculation
- construction of a minimum global tree based on mutual reachability distance
- pruning of trees
- cluster selection by excess of mass
This article is not a theoretical explanation, but a practical one that emphasizes intuitive understanding. In particular, it is unique in its extensive use of visual examples to illustrate the operation of the algorithm.
HDBSCAN explained → install and run in PythonInstall - Qiita
from BERTopic: Neural topic modeling with a class-based TF-IDF HDBSCAN BERTopic uses the HDBSCAN algorithm to cluster documents on the embedding space. This is one of the key steps in BERTopic’s topic modeling process.
HDBSCAN, which stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise, is a density-based clustering algorithm. It has the following characteristics
-
can take into account differences in density HDBSCAN allows clustering based on density differences. This automatically separates high-density and low-density areas to find more compact and meaningful clusters.
-
robust against noise Points in low-density areas can be treated as noise, making them less susceptible to outliers. This is an important property in topic modeling to properly handle irrelevant documents.
-
no need to specify the number of clusters Unlike many clustering methods, HDBSCAN does not require a pre-specified number of clusters. This is especially useful when the optimal number of topics is unknown.
-
hierarchical clustering is possible HDBSCAN can capture the hierarchical structure of clusters. This may allow for topic granularity and the possibility of finding more detailed subtopics.
BERTopic uses this HDBSCAN algorithm to cluster documents in the embedding space, mapping each cluster to a single topic. This allows semantically related documents to be grouped into the same topic, resulting in a more interpretable topic model. By taking advantage of the properties of HDBSCAN, BERTopic provides flexible topic extraction that is robust to noise and accounts for density differences.
This page is auto-translated from /nishio/HDBSCAN using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.