2024-11-22 Cybozu Labs Study Session
- Previously: 2024-10-18 Public Opinion Map Study Group.
- It was the day that âpublic opinion mapâ, a service that collects and visualizes peopleâs opinion vectors in the 2024 lower house election, was released.
- This ended up with 4403 rows x 84 columns of data
- (We are in the process of converting it to open data and will write about it here when it is ready)
- In this article, I will explain âanalysis of high-dimensional data,â including the analysis of this data.
- Rough Table of Contents
- History of dimensionality reduction (from PCA to UMAP)
- UMAP visualization experiment for the 2022 Upper House election
- pleasant feeling
- UMAP of public opinion map Experiment
- Struggle with high-dimensional mud dumplings
- Distressed â Aware
- New projects (to be announced 11/22)
- Extraction of dense masses
- Relation to the KJ method
History of dimensionality reduction (from PCA to UMAP)
-
PCA (Principal Component Analysis) is a linear dimensionality reduction method
- The classic method proposed by Karl Pearson (Karl Pearson) in 1901
-
Mid-20th century: Limitations of PCA (reliance on linearity) begin to be pointed out and the need for methods to handle nonlinear structures increases.
-
1960s: Multidimensional Scaling (MDS)
- Mapping high-dimensional data to low-dimensional space based on a distance matrix.
-
Classical MDS is also called Principal Coordinate Analysis (PCoA), and is equivalent to Principal Component Analysis when Euclidean distance is used in Principal Coordinate Analysis. --- Multidimensional scaling - Wikipedia
- There are linear and nonlinear versions, and nonlinear MDS can handle non-Euclidean distances.
-
1982: Self-organizing map (SOM, Self-Organizing Maps)
- Actually, my PhD research 20 years ago was one of the extensions of SOM, so when I told ChatGPT to write a history of nonlinear dimensionality reduction and this came up, I was like, âWhoa, a familiar face!
- ChatGPT explains that itâs âbased on a neuroscientific approach, mapping data onto a low-dimensional grid,â and well, in general, thatâs not wrong, but itâs a âmodel inspired by the neural connections in the brain,â which is common at this time, and itâs so close that it no longer has much to do with the behavior in the brain. It doesnât have much to do with the behavior in the brain.
- Well, now neural nets are much less neural.
- k-means method with adjacencies/distances between representative points
- Pioneering role as an analysis method that takes into account the topological characteristics of data
- As of 20 years ago, extensions to non-Euclidean distances were being made.
- ChatGPT explains that itâs âbased on a neuroscientific approach, mapping data onto a low-dimensional grid,â and well, in general, thatâs not wrong, but itâs a âmodel inspired by the neural connections in the brain,â which is common at this time, and itâs so close that it no longer has much to do with the behavior in the brain. It doesnât have much to do with the behavior in the brain.
- Actually, my PhD research 20 years ago was one of the extensions of SOM, so when I told ChatGPT to write a history of nonlinear dimensionality reduction and this came up, I was like, âWhoa, a familiar face!
-
1998: Kernel PCA (Kernel PCA)
- kernel method to deal with nonlinear structures.
- regenerative nuclear hillbelt space (RKHS) to apply PCA.
- One of the derivations from the prosperity of [Support Vector Machine]
- Kernel concepts are getting renewed attention in relation to the mathematics of deep learning.
- kernel method to deal with nonlinear structures.
-
Development of âanalytical methods that take into account the topological properties of the data.
- 2000: Isomap (Isometric Mapping)
- Distance structure is approximated based on graphs, and dimensionality reduction preserving the manifold structure.
- 2000: LLE (Locally Linear Embedding)
- Dimensionality reduction is achieved by exploiting the local linearity of the data.
- 2000: Isomap (Isometric Mapping)
-
2008: t-SNE (t-Distributed Stochastic Neighbor Embedding)
- Probabilistic representation of local similarity in higher dimensional spaces and mapping to lower dimensions.
- Widely used as a specialized method for visualization.
-
2018: UMAP (Uniform Manifold Approximation and Projection)
- Improvements to the shortcomings of t-SNE (computational load and collapse of local structure).
- Mainstream modern visualization and dimensionality reduction methods.
-
I was going to explain the detailed algorithm of UMAP, but that alone would have taken up an hour, so I decided to save it for another time.
- Summary of this section
- PCA is a classical dimensionality reduction method from 1901
- Nonlinear dimensionality reduction methods have been studied for many years because of the disadvantages of being linear
- Right now itâs considered a good idea to use the 2018 UMAP.
- Summary of this section
-
Cases where PCA is not a good choice
-
- antagonistic dimension
- 1: A and B are in conflict
- 2: (A+B) and C are in conflict
- 3: ((A+B)+C) and D are in conflict
-
-
My experience so far has been that Polis tends to become less and less divided into clusters as it progresses.
- I didnât want all the clusters to be separated at the end of this poll map.
- So I was considering a version that uses non-linear UMAP instead of dimensionality reduction using the original Polis PCA.
- In the experiment of the public opinion map, the âtendency for the clusters to become less divided as they progressâ was not observed.
- This phenomenon may not be due to an increase in the number of voting users, but rather to the fact that the opinion space is becoming more highly dimensioned as users contribute new opinions
- Detailed discussion: Problem with two Polis clusters.
-
As a preliminary experiment, I visualized the 2022 Upper House election data (Joint research by Taniguchi Laboratory, University of Tokyo and Asahi Shimbun) analyzed by PCA using UMAP before the public opinion map data were available ( 592 respondents, 42 questions)
- Left: UMAP UMAP visibility for the 2022 Upper House election / Right: PCA Polis-like visualization of the 2022 House of Councillors election.
- The PCA is almost just an ellipse if you look at the dots.
- The party is surrounded by Convex Hull and painted in color because the meaning is not so clear as it is.
- This is what âPCA will no longer separate clustersâ means
- However, it is not divided into separate sections, but it seems to be very interesting, and Mr. Yasuno introduced it at the Keynote of Code for Japan Summit 2024 on 2024-11-16.
- Advantage is that it is PCA, so the axes make sense.
- Non-linear methods make the current interpretation impossible because the axis bends all over the place.
-
The UMAP analysis is very clearly divided into clusters compared to the PCA analysis, which is almost one cluster of ellipses.
- Red boxes are manually added for detailed explanation
- For a detailed cluster-by-cluster explanation, see: UMAP visibility for the 2022 Upper House election.
- Some islands are dense, occupied by a particular political party, while others are a mixture of several parties
- In other words, when there are parties A, B, and C, we tend to think that the division of opinion is between the parties, but in fact the division is between extremists and moderates within the parties.
- Red boxes are manually added for detailed explanation
-
I think this is a pretty interesting outcome.
-
I got the data for the poll map and immediately UMAPâd it.
- Public Opinion Map 3970 UMAP
- 3970 people x 84 dimensions (when data from all maps are combined)
-
Unlike the 2022 Upper House election data, there is no party data, so I used DBSCAN to cluster the data.
- DBSCAN has nothing to do with database scanning at all, but Density-Based Spatial Clustering of Applications with Noise
- Density-based clustering methods (Iâll talk more about them later, so thatâs enough for now)
-
There are about 13 scattered clusters and a central [high-dimensional mud pie
-
What is the data volume ratio?
- 3970 in all, with 655 smaller clusters around the perimeter.
- 84% of users are in a central cluster with no clear separation
-
Perhaps in the case of politicians answering a questionnaire with a name, there is a âdynamics of opinion polarizationâ such as the relationship with the claims of the party they belong to, position talk of the opposition party to the ruling party, group polarization, etc. - The opinion vectors of anonymous citizens are not so clearly separated.
Struggle with high-dimensional mud dumplings
-
Can we analyze this part of the center a little better?
- It looks like it has a meaningful internal structure, doesnât it?
-
I was trying to analyze this part of the analysis (Should UMAP results be clustered?, I started thinking that it might be better not to cluster them).
- Increasing parameter eps to reduce gray âoutliersâ causes clusters to be merged
- This is more of a problem with interpreting DBSCAN âoutliersâ as âdata that is discarded because it is an outlier that is far awayâ and should be interpreted as âdata that is between two clusters and hard to say which side it is onâ.
- We want to separate A from B, so where do we draw the line?â
- Isnât this question wrong in the first place? that this question is wrong in the first place, isnât it?
- Drawing a boundary line is putting in the assumption that it is divisible by a line of no thickness.
- In reality, many things in the world do not have clear boundaries, but are divided across a wide âneither-here-nor-thereâ zone
- We want to separate A from B, so where do we draw the line?â
- (I understood later that this was reinventing the design concept of the DBSCAN/HDBSCAN system.)
-
I shared this story within the Yasuno team and they told me an interesting story: Careful clustering of tSNE results.
-
Attempted clustering directly on higher dimensional space rather than on post-UMAP data
- I was shocked to see these results.
- What I mean is.
- Cluster-like dense massesâ (blue and purple) after 2D visualization in UMAP
- That this is actually not dense at all in higher dimensional space.
-
What is happening?
-
- The greater the amount of data that is neither for nor against, the more reddened it is.
- The âcluster-like dense clustersâ were collections of data with few missing values, and the surrounding complex and meaningful shapes were data with many missing values.
-
Originally, this data was set to
matrix.dropna(thresh=3)
(only keep the data with 3 or more missing data), but when I changed this tomatrix.dropna(thresh=8)
, the UMAP result was as follows- Almost all of the complex shapes were artifacts caused by missing values.
- Reprinted: The opinion vectors of anonymous citizens are not so clearly separated.
-
Maybe itâs like this:.
- The process of filling in missing values with averages has created a very strong cluster of âpeople who hardly vote at allâ.
- Those who voted properly a lot are scattered in the higher dimensional space, forming a âthin, clear, boundaryless massâ.
- UMAP estimates density using the distance to the k nearest neighbor points, which interferes with the characteristics of the opposite data where the data are discrete and overlap at the same coordinates, and is recognized as âclusters that are denser than they really areâ, and UMAP attempts to isolate and visualize the characteristic clusters.
-
I have not been able to verify it thoroughly, but once I decided that the discrete value opposite vector is not suitable for UMAP, I put this line of thinking on hold.
-
I discarded all the people with missing values and UMAPed the data for only one map, not the 84 dimensions of the whole map, and I was able to get a better idea.
-
This also supports the hypothesis that âthe lack of cluster separation is not a phenomenon caused by an increase in the number of voting users, but by the opinion space becoming more dimensional as users contribute new opinionsâ for the negative observation in Problem with two Polis clusters. Support.
-
New projects (to be announced 11/22)
- This is an analysis for embedded vectors of natural language
- I found the characteristics of the data distribution to be important, so I examined them with similar data.
- distribution of text-embedded vectors
- Argument 7574 extracted by Talk to the City extruction from the 18129 retrieved from X
- Embedded in text-embedding-3-large, 3072 dimensions
- A single mass of nearly uniform distance with some dark areas in places.
- Conceptual drawing: One lump with small grains.
-
- Clustering and Partitioning
- The âwhere do we draw the lineâ question I mentioned earlier.
- If you have a density distribution with nothing in between the density peaks like the one on the left, you donât have to worry about clustering.
- Real world data has no clear boundary because the data is also between density peaks, as shown on the right.
- Traditional âclusteringâ such as K-means attempts to âsplitâ the data set
- That âdivisionâ (partition) is called a cluster.
- Even data that are quite far from the peak of A, such as X, separated by a thin density zone, would be considered a âcluster of Aâ.
- The DBSCAN authors saw this as a problem.
- What is a âclusterâ anyway?
- The concept of âclustersâ redefined by the authors of DBSCAN
- A cluster is an area where data points are densely distributed.â
-
a cluster is defined to be a set of density-connected points which is maximal wrt. density-reachability.
-
-
- The idea that a cluster is just a region of data points above a certain density, and X is noise that does not belong to any cluster.
- To distinguish between these two ways of thinking about clustering, they refer to the former as âpartitioningâ
- A cluster is an area where data points are densely distributed.â
- Clustering of opinion vectors using Talk to the City is âpartitioningâ
-
- Differences between SpectralClustering and HDBSCAN
- Spectral Clustering is used by Talk to the City for visualization purposes.
- HDBSCAN is âHierarchical DBSCAN
- While watching the discussion in Nittele NEWS x 2024 House of Representatives Election x Broad Listening, I wondered, âWouldnât it be better to cut out the high-density parts and discard them as âX is low-density dataâ so that downstream âgeneration of AI-based cluster commentary,â etc. would work better.â I was thinking.
- Then I learned about DBSCAN and other issues and thought, well, itâs the same thing!
-
- Differences between SpectralClustering and HDBSCAN
- Now that I understand DBSCAN/HDBSCAN a little better, I decided to use that policy to tweak the Talk to the City algorithm.
- Experiment log: pTTTC2024-11-12.
- Argument 7574 extracted by Talk to the City extruction from the 18129 retrieved from X
- Embedded in text-embedding-3-large, 3072 dimensions
- From here, we extracted the densely populated areas with HDBSCAN(min_cluster_size=5, max_cluster_size=30, min_samples=2) and obtained 79
- This setting is intended to pick up a lot of fine clusters
- Visually, it became, âItâs definitely dense content.â
- 739 data used for the cluster, 9.8% coverage
- Judgment that 90% is noise
- Sometimes we call noise the outlier, but when we say outlier, youâd think itâs relatively small.
- This approach has advantages and disadvantages compared to the current TTTC
- The current TTTC, which shows the whole picture with scatter plots and forced clustering, is well suited to satisfy the desire Letâs start with the broad strokes⊠I want to get the whole pictureâŠ.
- During the gubernatorial campaign, it was like theyâd shut up and show it to the people across the net OR the policy team would read it and refer to it.
- Some policy team members suggested that another method Talk to the City Turbo was more in-depth and helpful (N=1).
- Nittele NEWS x 2024 House of Representatives Election x Broad Listening needed to provide âcommentaryâ for viewers
- The process clarified the need to not only present the big picture, but also to âdig deeper into deep opinions.
- In this process, the current TTTC can only show N=1 by âmouse hovering over a pointâ.
- Also, there is a lack of support in that âfinding an opinion to focus onâ area.
- You just look at a few at random and leave it up to luck.
- You end up looking at all 7,000 cases.
- The new method we have developed will improve this area.
- Reduce the burden of âlooking at 79 clusters with high concentration instead of looking at all 7,000 cases.â
- Increased persuasiveness by being able to present as âN=5~30 opinions instead of N=1â.
- During the gubernatorial campaign, it was like theyâd shut up and show it to the people across the net OR the policy team would read it and refer to it.
- We applied this algorithm to the data in TTTC: Public Comments on AI and Copyright, which can be made publicly available.
- https://github.com/nishio/ai_kj/blob/main/dense_cluster_extruction.ipynb
- 5957 cases 1536 dimensions
- 48 clusters were extracted
- We had GPT-4o score the cluster interesting, but that in itself did not yield very interesting results!
- Well, this may be partly because we havenât taught the AI what the purpose of research is and what kind of things are interesting.
- It would be more interesting to include a prompt in the output list of clusters to âList clusters that introduce a unique perspective and what is original about them.
- ChatGPT log
Relation to the KJ method
- In the June Talk to the City Study Group, I also wrote about the connection with the KJ method in the appendix.
-
- way of thinking p.77
-
In his mind, he has a dogmatic principle of grouping, such as, âIn my opinion, it is correct to divide this paper material into three major categories: market research, quality control, and labor management. They are simply applying this dogmatic classification scheme and sifting and fitting the paper materials into a ready-made framework. This is a mere sifting and fitting of paper materials into the ready-made wakku, which completely defeats the conceptual significance of the KJ method.
- Jiro Kawakita criticized human beings for sieving, fitting into [* dogmatic classification wakugumi
- As a way to deal with this problem, I proposed the KJ method, which is to look at each piece one by one and attach it to the related pieces.
- This is Agglomerative Hierarchical Clustering in data science terms.
- TTTC avoids âdogmatic human fittingâ by having machines, rather than humans, perform the clustering.
-
- This time, the relationship with the KJ method has become even clearer.
- Clustering in the sense of âpartitioningâ has split the data set top-down.
- Extracting clusters with high concentration using a method like DBSCAN is almost equal to âcollecting related things to form a groupâ in the KJ method
- Leaving things that donât fit well in a cluster without putting them in a cluster is the same as the KJ method âleave monkey awayâ.
- It is not KJ legal to say that 90% of the data will be away monkeys.
- This phenomenon is caused by the fact that this method defines âproximityâ as âdistance in the embedded vector space.
- When humans do the KJ method, they are asked to find connections between things that do not seem to be connected and group them together, and they âdiscover relationshipsâ such as âThis X and Y do not seem to be connected at first glance, but they are related on side A.
- So this is finding another meaning of ânearnessâ by performing an additional operation on something that is âfarâ in the sense of âcosine distance in the embedded vector spaceâ.
- I think the reason why the KJ method causes the promotion of ideas as a âway of thinkingâ or âintellectual production techniqueâ is because it forces people to do this âsearch for relationships.
- How can a machine search for relationships that are not expressed in the distance of the embedding vector?
- In my book âNatural Language Processing with word2vec, [Similarity of concepts is not distance.
- In this case, it may be possible to deal with the problem by randomly collapsing axes, or by collapsing axes with low contribution rate by PCA, or by collapsing axes with the closest contribution rate by reversing the idea (future experiment).
- If youâd rather get into the LLM context, you could let LLMs âdiscover relationshipsâ just like humans.
- I think itâs good to chop it down to an amount that will fit into the LLM context.
- Because there is a limit to the number of labels that can fit in the human field of vision.
- KJ legal never throws away one of the away monkeys.
- In order to achieve this, we need to figure out how to merge the separate monkeys into a cluster.
- Maybe on a mere distance basis it would âmove, but not with good resultsâ.
- Need to experiment to see if I can prompt the LLM to do it.
- The KJ method is a technique that requires a great deal of skill to begin with, and Jiro Kawakita, the worldâs most experienced KJ methodner, says â50 KJ method assumes 10 hoursâ and has done only 129 KJ sheets in 5 years because it takes too much time. I felt that the time was approaching when AI would make it an easy method.
This page is auto-translated from /nishio/é«æŹĄć ăăŒăżćæććŒ·äŒ using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.