pTTTC2024-11-12

Nittele NEWS x 2024 House of Representatives Election x Broad Listening I forked it anyway. https://github.com/nishio/shugiinsenyo2024-tttc

% pip install langchain_openai % pip install python-dotenv

Error after getting to embedding % pip install --upgrade langchain % pip install -U langchain-community This fixed it.

HDBSCAN in (7574, 3072) dimensions takes 5 minutes

(1000, …) would be 5 seconds after random sampling to
The dimension of the embedding vector probably only comes into play on a linear order, so I don’t think it needs to be shortened.

I was going to extract about 30-100 clusters. Unlike k-means, the number of clusters cannot be specified in the first place.

If min_cluster_size=5 is smaller, it will increase.
- However, “Two data clusters!” is subtle
min_samples=2 can be increased by decreasing
- whatever the principle is
- Seems to be about density calculations.
max_cluster_size=30
- This will prevent huge clusters from appearing.
- It doesn’t break up huge clusters, and it takes a little more experimentation to understand the behavior in this area.

With 1000 cases, (min_cluster_size, min_samples) = (5, 2), I got 10 clusters.

(2, 2) would increase to 38 clusters, but “two would mean a cluster”.
- Isn’t that a bit much?
- But conversely, everything else is considered sparse area data.
- When there is very little data, “I have no choice but to try (2, 2)“.
  - Nuance like, “It could be a fluke, it could be dismissed when more data is gathered, but at this stage, it’s a likely candidate.”

The more data there is, the higher the probability that the density of data exceeds a fixed threshold.

7,000 cases, (5, 2) made 79 clusters.
But well, the problem is that HDBSCAN is nonlinearly heavy.
- Running a parameter search on a large amount of data is expensive, but sampling changes the behavior.
- If it is a random sampling, the inference that “if it is like this when there are 1000 cases, it will be like this when there are 7000 cases” is likely to be valid.
  - Well, doing normal math, 1000 (2, 2) is equivalent to 7000 (14, 14).

Extract “why are these lumped together” for these data

```json
[
    {
        "explanation": "...",
        "label": "..."
    }
]


It's annoying that sometimes it comes back in a different way than planned.

It should now be possible to make a call to clarify the JSON schema, so we'll look into it.

2024-11-13
TTTC for Nittele, Azure support and JSON enforcement for output are all included, too kind and impressive!

79 clusters were created from 7574 data
- 739 data used, coverage of 9.8%.

I got good experimental results as I thought, so I want to do the same thing with the data I can publish and write an article about it.
- I'd also like to do a story on the Yasuno team's case with Nittele.
- I'd like to write an article about the public opinion map.
    - You have three!

![image](https://gyazo.com/aa54ff46f8ab889e5458cac0172fc9a4/thumb/1000)
- Conventional TTTC tries to classify all data into clusters
    - So I hope it is a cluster of clearly separated islets like (1).
    - In (2) and (3), multiple topics got mixed up, and there were things like "AI's explanation and a very different story are mixed up.
- The experiment we did this time was to stop "categorizing all the data" and cut out only the densely concentrated areas.
    - 79 clusters were created. This one only took out the dense parts, so the clusters that were created are dense.
    - but there are many opinions that have not been selected for the cluster. Maybe more than you imagine, about 90% of them have been discarded.
- Neither of these two is better than the other, but both have advantages and disadvantages and should be used well.

The current TTTC, which shows the whole picture with scatter plots and forced clustering, is well suited to satisfy the desire [[Let's start with the broad strokes...]] [[I want to get the whole picture...]].
- During the gubernatorial campaign, it was like they'd shut up and show it to the people across the net OR the policy team would read it and refer to it.
    - [[Nittele NEWS x 2024 House of Representatives Election x Broad Listening]] needed to provide "commentary" for viewers
    - The process clarified the need to not only present the big picture, but also to "dig deeper into deep opinions.
        - In the process of doing this, the current TTTC only allows you to "mouse hover over a point to show it" and show N=1.
        - Support is scarce in that "finding an opinion to focus on" area.
            - You just look at a few at random and leave it up to luck.
            - You end up looking at all 7,000 cases.
    - This new experiment will improve this area.
        - Reduce the burden of "looking at 79 clusters with high concentration instead of looking at all 7,000 cases."
        - Increased persuasiveness by being able to present as "N=5~30 opinions instead of N=1".

2024-11-14
[[Difference between DBSCAN and HDBSCAN#67352797aff09e00003bcfa4|67352797aff09e00003bcfa4]]
If this is the case, my use case would be better to do a light DBSCAN with different parameters.
Let's extract 100 clusters and do KJ method for them.

- [[distribution of text-embedded vectors]]

Even if I were to use HDBSCAN, Leaf Clustering seems better suited for my purposes.
[Parameter Selection for HDBSCAN* — hdbscan 0.8.1 documentation](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#leaf-clustering)

---
Special Feature Sort ✅

I tried re-running the system with another TTTC data (AI PubCom), a mechanism to extract dense clusters, and it worked easily.
- [https://github.com/nishio/ai_kj/blob/main/dense_cluster_extruction.ipynb](https://github.com/nishio/ai_kj/blob/main/dense_cluster_extruction.ipynb)
- [[Extraction of dense masses]]

AI's idea of "interesting and dense clusters"
- Personally, I find the lower scores more interesting and unfamiliar.
- Well, this may be partly because we haven't taught the AI what the purpose of research is and what kind of things are interesting.
Would it be better for the AI to read and critique the list after it is made, rather than individually give the degree of interest?
The prompt "a perspective like no other" sounds good, interesting.

<img src='https://scrapbox.io/api/pages/nishio/gpt/icon' alt='gpt.icon' height="19.5"/>Clusters that introduce a unique perspective and their originality
- C16: Illegal Content Risk of AI Training Data
    - It is unique in its perspective on ethical concerns in AI learning, with a special focus on the presence of illegal content, particularly the possible inclusion of child pornography. While other clusters focus on copyright and unauthorized use, this one takes a step further into the ethical boundaries of AI technology and expands its perspective to legal risks.
- C12: Concerns about misuse of AI technology
    - It is unique in that it refers to risks such as misuse for commercial purposes or fraudulent activities, as well as economic impact that threatens the lives of others, and introduces the perspective that the use of AI goes beyond infringement of creators' rights and relates to the safety of society as a whole.
- C4: Importance of Transparency and Accountability
    - It differs from the other clusters in its emphasis on transparency regarding the development and training process of AI systems and its call for specific measures at the intersection of technology and ethics.

These clusters offer creative perspectives on the application of AI technology in ethics, legal risks, and transparency.

[https://chatgpt.com/share/6735c2e5-5bac-8011-bccc-d8ffd64fd1d9](https://chatgpt.com/share/6735c2e5-5bac-8011-bccc-d8ffd64fd1d9)

---
requirements.txt
- [https://gist.github.com/nishio/756cd4c6dee7203cc6a32172ce1e2f78](https://gist.github.com/nishio/756cd4c6dee7203cc6a32172ce1e2f78)

:

scatter/pipeline/steps/embedding.py:3: LangChainDeprecationWarning: Importing OpenAIEmbeddings from langchain.embeddings is deprecated. Please replace deprecated imports:

from langchain.embeddings import OpenAIEmbeddings

with new imports of:

from langchain_community.embeddings import OpenAIEmbeddings You can use the langchain cli to automatically upgrade many imports. Please see documentation here https://python.langchain.com/docs/versions/v0_2/ from langchain.embeddings import OpenAIEmbeddings /Users/nishio/shugiinsenyo2024-tttc/venv/lib/python3.10/site-packages/langchain/chat_models/init.py:33: LangChainDeprecationWarning: Importing chat models from langchain is deprecated. Importing from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead:

from langchain_community.chat_models import ChatOpenAI.

To install langchain-community run pip install -U langchain-community. warnings.warn( venv/lib/python3.10/site-packages/langchain/chat_models/init.py:33: LangChainDeprecationWarning: Importing chat models from langchain is deprecated. Importing from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead:

from langchain_community.chat_models import ChatOpenAI.

To install langchain-community run pip install -U langchain-community. warnings.warn(



I'm not going to worry about it once.
- I was just messing around with it and it got complicated.

Just a note on the "complications."
:

from huggingface_hub import HfApi, HfFolder, Repository, hf_hub_url, cached_download ImportError: cannot import name ‘cached_download’ from ‘huggingface_hub’ (/Users/nishio/shugiinsenyo2024-tttc/venv/lib/python3.10/site-packages/huggingface_hub/init.py)


<img src='https://scrapbox.io/api/pages/nishio/gpt/icon' alt='gpt.icon' height="19.5"/>This error may be caused by a newer version of the Hugging Face Hub library. cached_download function has been removed or deprecated in recent versions and an alternative method is recommended.
- After this, AI recommends `pip install huggingface_hub==0.12.0`, but the last one that worked properly was `huggingface-hub==0.25.1`, so you're backtracking too many versions.
- Well, it's a place where information may not be up-to-date or halcyonized.

:

ImportError: cannot import name ‘HF_HUB_DISABLE_TELEMETRY’ from ‘huggingface_hub.constants’ (/Users/nishio/shugiinsenyo2024-tttc/venv/lib/python3.10/site-packages/huggingface_hub/constants.py)



---
This page is auto-translated from [/nishio/pTTTC2024-11-12](https://scrapbox.io/nishio/pTTTC2024-11-12) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at [@nishio_en](https://twitter.com/nishio_en). I'm very happy to spread my thought to non-Japanese readers.

🪴 Quartz 4.0

pTTTC2024-11-12

Graph View

Backlinks