2024-11-08 When I was experimenting with clustering the resulting data from UMAP with DBSCAN, I felt that the assumption in clustering that the data points should belong to some single cluster was undesirable.
- At the time of UMAP, high-dimensional data is already visualized with dimensionality reduced to two dimensions.
- Even if clustering is used to explain this, it is not necessary to âmake all data points belong to one of the k clustersâ.
- You are making the problem more difficult by putting in constraints that are not necessary.
- I think you are lowering the quality of downstream cluster commentary by AI by doing unreasonable clustering.
- Instead, it may be better to extract a few âhigh-density clumpsâ and provide AI explanations for them, which would be easier to understand and more satisfying for the user.
Thought notes on the work process
Density-based spatial clustering of applications with noise (DBSCAN)
The original data is below, how should this be clustered?
DBSCAN has a parameter eps that determines how close points are considered as proximity and a parameter min_samples that determines how many points can be discarded as outliers So I observed what happens under what parameters.
- There are 3970 original data, and I felt that determining min_samples as 5 or 10 was too ad hoc, so I set it as a ratio to the original data.
- In this use case, because of the independence of the original data, the data is randomly thinned by half when there are half as many people, which is reasonable because in that case all cluster members are halved.
- Those determined to be outliers were colored light gray.
- Itâs important to visualize the results, you know, âWeâve got five clusters!â and then you come up with this, Iâd be tempted to say, âStart over! Iâd say, âRedo it!
- This is considered an outlier because the eps are too small and most of them fall apart.
- You might want to look at âPercentage of data discarded as outliers.â
- Itâs important to visualize the results, you know, âWeâve got five clusters!â and then you come up with this, Iâd be tempted to say, âStart over! Iâd say, âRedo it!
- In terms of not throwing away too much data, I would prefer this one of these 15 ways
- On the other hand, as for the larger clusters with more content, Iâd like to see a little more separation.
-
Largest cluster label: 0, Size: 3313
central cluster
- Ummm, the diagonal line between the two is subjectively preferable, but itâs still an outlier.
- If you increase eps to reduce outliers, clusters will be merged
- Hmmm, I wonder if this is more a problem with interpreting DBSCAN âoutliersâ as âdata that is discarded because it is an outlier that is far awayâ and should we interpret it as âdata that is between two clusters and hard to say which side it is onâ?
- We want to separate A from B, so where do we draw the line?â
- Isnât this question wrong in the first place? that this question is wrong in the first place, isnât it?
- Drawing a boundary line is putting in the assumption that it is divisible by a line of no thickness.
- In reality, many things in the world do not have clear boundaries, but are divided across a wide âneither-here-nor-thereâ zone
- We want to separate A from B, so where do we draw the line?â
- I was surprised at first that this part of the chord was split.
- You can see why when you see less eps and higher ratios.
- (I initially went through with âtoo many outliers,â but this is actually important.)
- I, as a human being, had not noticed this just by eyeballing the two-dimensional diagram, but there is a zone of high concentration at the tip of the chasus
- KDE](Kernel density estimation) for easy understanding of density is like this
- The territory is expanding from this zone of density to the extent of eps, which is considered a landlocked area.
- You can see why when you see less eps and higher ratios.
I saw the KDE and thought Iâd rather KDE with a wider band.
-
- [Letâs start with the broad strokes.
- First of all, we assume the worldâs perception that this kind of thing exists.
- Among them, the distribution of opinions, especially in the high-density zone, is thus
- And then, AI provides explanations for these âdense clustersâ.
- I think thatâs a good way to explain it.
Iâve summarized at the top of the page regarding this realization.
Other Methods
- With respect to the central cluster, we are trying to cluster based on the âfirst UMAP for all dataâ, but there is a way to go back to the original data and UMAP it again.
- But I think itâs too difficult to explain for the general public.
- For the original data in this case, each axis is for and against data, so there is an option to use For and Against Density Indication without clustering.
- Cannot be used for UMAP embedding of natural language
- In this case, is the approach to let the cluster itself be discovered by the LLM?
- Cannot be used for UMAP embedding of natural language
I got feedback on this article.
This page is auto-translated from /nishio/UMAPăźç”æăăŻă©ăčăżăȘăłă°ăăăčăă using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.