-
I was given an interesting discussion regarding [Should UMAP results be clustered?
-
interpretation - Clustering on the output of t-SNE - Cross Validated
-
When applied to high-dimensional normal distributions that are inherently difficult to reduce to two dimensions, tSNE (and perhaps UMAP) can overreact to local density changes due to randomness, creating artifacts (non-existent clusters).
- This problem is noticeable when experimenting with mathematically generated data.
-
Real-world data often has âgood featuresâ such as a few scattered clusters of âhigh-density areasâ surrounded by âlow-density areas,â so tSNE and UMAP, which visualize data based on distance from neighboring data points, can properly reduce dimensionality while preserving these features.
-
On the other hand, if the data is used for a single block of high-dimensional data, some distortion will occur because the data is essentially two-dimensional and cannot be expressed in two dimensions. This can appear in the form of tearing apart a single cluster into multiple clusters or overstating density fluctuations that have no significant meaning, and this has a negative impact on the clustering at a later stage.
-
Probably the real data is made up of several small, nicely separated clusters and one or two large, high-dimensional, hard-to-separate mud clusters, and âtSNE and kMeansâ is the worst way to analyze these mud clusters because the density information is distorted, while DBSCAN systems are better because of their âfocus on neighborhoodâ property similar to tSNE and UMAP. DBSCAN systems are better thanks to the âfocus on the neighborhoodâ property similar to tSNE and UMAP (I wonder if SpectralClustering used by TTTC is the same?) However, in the heuristic of âusing a high-density area as a seed and searching the neighborhood from there,â there seems to be a possibility that the location of the seed may be wrong.
-
In light of this, the method I tried in âShould UMAP results be clustered?â of separating the peripheral clusters from the central mudballs is still fine, but applying density estimation to the mudballs is clearly a bad idea. It would be better to give up visualization of the mudballs and cluster them on the higher dimensional space of the original data.
- How to tackle high-dimensional mud pie will be a challenge for the futureâŠ
Especially when the original data is low-dimensional, tSNE and UMAP create artifacts
This page is auto-translated from /nishio/tSNEăźç”æăźăŻă©ăčăżăȘăłă°ăŻæ éă« using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.