- Q&A in the context of [UMAP visibility for the 2022 Upper House election
- Q: What is UMAP and how is it different from PCA?
- A:
- PCA is about turning a higher dimensional object around and observing it from the widest possible direction.
- In the case of public opinion map, the observation target is around 10 dimensions at most, while Joint research by Taniguchi Laboratory, University of Tokyo and Asahi Shimbun data is 42 dimensions, so it tends to be “impossible to see it from any angle…“. The data tends to be thicker in the depth direction…“.
- UMAP (as well as other nonlinear dimensional reduction methods) is like pressing a rubber membrane on an object of observation and trying to capture its shape as well as possible.
- While the vertical and horizontal axes are meaningless because the membrane bends, depending on the object under observation, the structure can be captured more clearly than with PCA.
- UMAP is a newer non-linear method that came out in 2018, and Talk to the City also uses it in it.
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction Leland McInnes, John Healy, James Melville UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning. https://arxiv.org/abs/1802.03426
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction — umap 0.5 documentation
https://qiita.com/odanny/items/06ab88353bcee7bf6aa7
from BERTopic: Neural topic modeling with a class-based TF-IDF UMAP UMAP stands for Uniform Manifold Approximation and Projection and is a dimensionality reduction method for embedding high-dimensional data into low-dimensional space. BERTopic uses this UMAP to convert high-dimensional document embeddings into a low-dimensional space suitable for clustering.
UMAP has the following characteristics
-
nonlinear dimension reduction UMAP is capable of nonlinear dimensionality reduction and preserves both the local and global structure of the data. This allows for a more faithful representation of the intrinsic structure of the data in a lower dimensional space.
-
high flexibility UMAP has several hyperparameters, such as neighborhood graph construction methods and cost functions. By adjusting these parameters, you can obtain the optimal embedding according to the characteristics of your data.
-
scalability UMAP works efficiently with large data sets. This is an important property when dealing with large numbers of documents, as in topic modeling.
-
probabilistic interpretation The UMAP embedding has a probabilistic interpretation. This allows us to think of the distance on the embedding space as reflecting the similarity between the data points.
In BERTopic, documents are embedded into high-dimensional vectors in a language model such as BERT and then projected into a low-dimensional space using UMAP. Applying HDBSCAN on this low-dimensional space allows for efficient and effective clustering.
The non-linearity and scalability of UMAP allow BERTopic to extract meaningful topics even for large, noisy data sets. It is also possible to control the granularity of topics to some extent by adjusting the parameters of UMAP.
As described above, UMAP is an important preprocessing step in BERTopic, serving to convert high-dimensional document embeddings into a form suitable for clustering.
This page is auto-translated from /nishio/UMAP using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.