nishio Public Comment on AI and Copyright collected by Agency for Cultural Affairs was analyzed and visualized by AI(Talk to the City) nishio You can switch to Japanese by using the flag icon in the upper right corner https://nishio.github.io/tttc_aipubcom_japan/
experimental process
Talk to the City [/tkgshn/report on Taiwanβs same-sex marriage debate at Talk to the City](https://scrapbox.io/tkgshn/report on Taiwanβs same-sex marriage debate at Talk to the City).
2024-05-27
- I was curious about the algorithm for Talk to the City clustering on 2024-05-26, so I looked it upGovernorβs race x digital democracy
- https://github.com/AIObjectives/talk-to-the-city-reports.git
- tkgshn had a hard time using the visual tool βTalk to the City Turboβ, but there is also a CLI tool βTalk to the City Reportsβ, so letβs take a look at that.
- Pipeline is a single track
Consider the data to be tested - Public Comment on AI and Copyright - Itβs rather trashyβ¦
I will do a Polis-like analysis of Joint research by Taniguchi Laboratory, University of Tokyo and Asahi Shimbun (UTAS) on another matter first.
- Polis-like data and natural text are both available.
- Fewer natural sentences
- First, I did a Polis-like analysis β pPolis2024-05-27.
2024-05-30 Extracting natural sentences from UTAS data
- https://gist.github.com/nishio/de629941b5ac4a080fae88348df1499c
- 126 Politicians Comment on Constitutional Amendments
- Lots of overlap.
- 85 excluding duplicates
- There are more comments about the House of Councillors.
- 126 Politicians Comment on Constitutional Amendments
Letβs try something small for now. Small Start Create and run CSV and config with comment-id and comment-body according to README
extraction
extraction
Running step: extraction
50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2/4 [00:05<00:05, 2.92s/it]JSON error: Expecting value: line 1 column 1 (char 0)
Input was: Severe punishment for foreign powers proceeding
Response was: I'm sorry, but I can only assist in English. If you provide the text in English, I'll be happy to help you refine it.
Retrying...
75%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3/4 [00:07<00:02, 2.51s/it]JSON error: Expecting value: line 1 column 1 (char 0)
Input was: Governing structure reform
Response was: I'm sorry, but it seems like the text "governance reform" is in a language that I am not able to understand. I'm sorry, but it seems like the text "governance reform" is in a language that I am not able to understand.
Retrying...
JSON error: Expecting value: line 1 column 1 (char 0)
Input was: Governing structure reform
Response was: I'm sorry, but it seems like the text "governance reform" is in a language other than English, and I am unable to provide a concise and clear argument based If you could provide a translation or more context, I would be happy to assist further.
Retrying...
JSON error: Expecting value: line 1 column 1 (char 0)
Input was: Governing structure reform
Response was: I'm sorry, but it seems like the text you provided is not related to the topic of artificial intelligence. Could you please provide an argument related to AI for me to help you refine?
Retrying...
JSON error: Expecting value: line 1 column 1 (char 0)
Input was: Governing structure reform
Response was: I'm sorry, but it seems like the text "governance reform" is in a language other than English. Could you please provide a translation or more context so I can assist you better? can assist you better?
Silently giving up on trying to generate valid list.
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:12<00:00, 3.23s/it]
Iβm at a loss to interpret this from Japanese. If it doesnβt work after a few retries, you give up. βSilently giving up on trying to generate a valid list.β However, the other comments are also in Japanese but easily interpreted, so I guess that means they are too short or ambiguous and fail to extract the meaning.
- π€βIβm not sure what you mean by βgovernance reformββ¦β
embedding
embedding
Running step: embedding
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:10<00:00, 10.26s/it]
clustering
:
Running step: clustering
Traceback (most recent call last):
File "/Users/nishio/talk-to-the-city-reports/scatter/venv/lib/python3.12/site-packages/nltk/corpus/util.py", line 84, in __load
root = nltk.data.find(f"{self.subdir}/{zip_name}")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/nishio/talk-to-the-city-reports/scatter/venv/lib/python3.12/site-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('stopwords')
No, no, no, no, no, no, no, no, no, no, no, no, no, no, no. clustering.py
stop = stopwords.words("english")
vectorizer_model = CountVectorizer(stop_words=stop)
Youβre hard-coded to assume that the language is English. Well, okay, letβs just go all the way to the end and get the big picture first.
:
Running step: clustering
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2024-05-30 14:43:23,417 - BERTopic - Reduced dimensionality
2024-05-30 14:43:23,423 - BERTopic - Clustered reduced embeddings
labelling, takeaways, overview
:
Running step: labelling
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:02<00:00, 1.20it/s]
Running step: takeaways
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:04<00:00, 1.67s/it]
Running step: overview
translation
:
Running step: translation
Translating to Mandarin...
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7/7 [00:16<00:00, 2.38s/it]
Translating to English...
0%| | 0/7 [00:00<?, ?it/s]JSON error: Expecting ',' delimiter: line 1 column 216 (char 215)
Response was: ["Through public debate, considering Japan's history and education, it is necessary to create a democratic law by taking time into account.", "It is important to reflect the public opinion by creating a process of "constitutional amendment" through consensus.", "It is necessary to stop the annual budget system.", "For the protection of human rights, it is essential to have discussions and collaborations on the internet.", "Through public debate, considering Japan's history and education, it is necessary to create a democratic law by taking time into account.", "By passing the process of "constitutional amendment", reflecting the public opinion is important.", "For Japan to consider Japan's future, it is necessary to establish a law that will become the basis.", "By revising the 41st article of the constitution, implementing the national referendum, and reflecting its contents in politics is important.", "Keep an eye on the "constitutional amendment".", "Revise the 41st law, implement the national referendum, and reflect its contents in politics."]
Retrying batch...
Warning: batch size mismatch!
Batch len: 10
Response len: 9
Batch item 0: Through public debate, an independent constitution that takes into account Japanese traditions and customs should be created over time.
Response: By engaging in public debate, considering Japan's history and education, creating necessary self-defense laws over time is essential
Batch item 1: Public input should be reflected through the "constitution-making" process
Response: By promoting the principles of 'innovation', reflecting the opinions of the people to be considered
Batch item 2: Single-year budget system to be eliminated
Response: It is necessary to stop the annual budget system
Batch item 3: Internet Initiatives Needed to Protect Human Rights
Response: For the protection of human rights, it is essential to have an internet platform for gathering information
Batch item 4: Take time to create an independent constitution that takes into account Japanese traditions and customs through public debate.
Response: By engaging in public debate, considering Japan's history and education, creating necessary self-defense laws over time is essential
Batch item 5: Public opinion should be reflected through the "constitution-making" process.
Response: Establishing laws to become a better future for Japan that the Japanese people consider
Batch item 6: A constitution should be created that the Japanese people believe will benefit Japan's future.
Response: By revisiting discussions and debates, developing Japan's unique education and history, creating necessary self-defense laws over time
Batch item 7: Take time to create an independent constitution that takes into account Japan's unique customs and traditions through discussions with the public.
Response: Keep the focus on 'innovation'
Batch item 8: "Constitutional Creation".
Response: Amend the 41st law, implement national referendums on important legislation, and reflect its contents in politics
Batch item 9: Article 41 of the Constitution should be amended to allow for referendums on important bills and their content should be reflected in the political process.
Retrying with smaller batches...
Iβm trying to translate it all together and failing.
- I had to start over a few times.
- As a result, it looks like the translation was done.
visualization
:
Running step: aggregation
Running step: visualization
> next-app@0.1.0 build
> next build
Static HTML is output
The title should have been set up in config or somewhere else.
Overview: The public consultation on constitutional development in Japan revealed three main clusters of opinions. Cluster 0 emphasized the importance of crafting a unique constitution reflecting Japanese traditions, while Cluster 1 focused on constitutional reform through public engagement and utilizing the internet for human rights protection. Cluster 2 highlighted the need for a new constitution with amendments to include national referendums and strengthen the role of the National Diet, along with calls for strict penalties against foreign interventions. Overview: (in Japanese only) The public consultation on constitutional reform in Japan revealed three main groups of opinion. Cluster 0 emphasized the importance of creating a unique constitution that reflects Japanβs traditions, while Cluster 1 focused on constitutional reform through public participation and the use of the Internet to protect human rights. Cluster 2 emphasized the need for a new constitution that includes a referendum on amendments and strengthens the role of the Diet, and called for severe penalties for foreign interference.
This is what happens when you hover. This is what happens when you click on it
Can we put this in this cluster?
Translation failure.
Representative comments: Through public debate, we should take the time to create an independent constitution that takes into account Japanese traditions and customs. We should take time to create an independent constitution that takes into account Japanese traditions and customs through public debate. Take time to create an independent constitution that takes into account Japanβs unique customs and traditions through discussions with the people. Through public debate, it is important to take the time to create an independent constitution that takes into account Japanese traditions and customs. Single-year budget system to be eliminated.
- Can we put this in this cluster?
Appendix
This is useful.
- I wish there was some progress data along the way.
AI Cost
amount of data
- Japanese 2450 characters
Letβs try a few more things before we try larger data sets. Revisit config. json
-
Titles should have been set up in config or somewhere else.
- NAME and QUESTION are it.
- Youβve fixed it to three clusters.
- The translation part, I set it up like this with the intention of processing it in Japanese and then translating it into English.
- But since the clustering was code that assumed English, Iβd rather process it in English and translate it into Japanese.
- If you want to process Japanese as it is, you need to set morphological analysis and Japanese stop-words for the part of the clustering step that uses CountVectorizer, assuming Japanese.
- Add βAll texts should be in English.β to the extraction prompt and proceed with the policy of extracting in English.
re-execute βSevere punishment for the progress of foreign powersβ and βreform of the governing structureβ were also said βI donβt know.β Oh, I forgot to translate it into Japanese. :
So, here is what I am planning to run:
{'step': 'extraction', 'run': False, 'reason': 'nothing changed'}
{'step': 'embedding', 'run': False, 'reason': 'nothing changed'}
{'step': 'clustering', 'run': False, 'reason': 'nothing changed'}
{'step': 'labelling', 'run': False, 'reason': 'nothing changed'}
{'step': 'takeaways', 'run': False, 'reason': 'nothing changed'}
{'step': 'overview', 'run': False, 'reason': 'nothing changed'}
{'step': 'translation', 'run': True, 'reason': 'some parameters changed: languages'}
{'step': 'aggregation', 'run': True, 'reason': 'some dependent steps will re-run: translation'}
{'step': 'visualization', 'run': True, 'reason': 'some dependent steps will re-run: aggregation'}
Looks good? Press enter to continue or Ctrl+C to abort.
Theyβll just redo the second half.
Translation from English to Japanese was batch-processed without any problems. Hmmm, I donβt know the term βconstitutionalβ, so itβs a misinterpretation.
prompt: All texts should be in English. Translate Japanese words based on their meaning, not just their sound.
Error in embedding
TypeError: expected string or buffer
- Thatβs the one.
- FIX: LLM sometimes returns string and the code handle it as list by nishio Β· Pull Request #186 Β· AIObjectives/talk-to-the-city-reports Β· GitHub
Should have named the project.
Public Comment on AI and Copyright
534365 characters I think itβs a few hundred yen even if you do it normally, but letβs try it with 1/10th of the data.
extraction
Input was: βA Proposed Guidelines for Business Operatorsβ, Attachment 1, Part 1, related/P13, line 20, βRisks due to Al Response was: Iβm sorry, but it seems like the text you provided is in Japanese and itβs not clear how it relates to the task of refining arguments in English. Could you please provide an argument or statement in English that you would like me to work on?
- I wonder if this is a misdirect.
Surprisingly cheap
openai.error.InvalidRequestError: This model's maximum context length is 16385 tokens. However, your messages resulted in 17461 tokens. Please reduce the length of the messages.
This is a problem.
Results of various experiments Itβs time to make the final version.
including everything Hmmm, thatβs all I get when I include corporate comments.
- Is the publicβs comment too thin?
- Thought it would be more interesting to mix it upβ¦
Letβs limit ourselves to 2013 data from individuals.
"clusters": 7
.
Hmmm, not much has changed.
Ah, "extraction" "limit": 12
this.
- Letβs change it to 24.
Increase.
But Warning: data for page "/" is 170 kB which exceeds the threshold of 128 kB, this amount of data can reduce performance.
Iβll change it further to 48.
:
File "/Users/nishio/talk-to-the-city-reports/scatter/pipeline/steps/translation.py", line 102, in translate_batch
parsed = [a.strip() for a in json.loads(response)]
json.decoder.JSONDecodeError: Invalid \uXXXX escape: line 1 column 1098 (char 1097)
Hmmm, this is tricky.
I redid it and it worked.
- I guess it means that retries are not well implemented, unconfirmed.
Itβs nice, after all, if it doesnβt have this many dots, it doesnβt feel like βitβs a visualization of a distributionβ.
- Like the feeling that the cluster in the lower right corner is a little too far away.
If the points increase as the limit of EXTRECTION is increased, it may mean that the reason why βonly opinions from the organization are picked when data from the organization is mixed inβ in the beginning is simply because the limit was reached only by opinions from the organization.
Iβm not sure why it only cost a few cents for both the 2450 Japanese characters UTAS data and the 534365 AI pubic data, because they both hit the LIMIT. As an experiment, we should try to make the limit bigger. Increase LIMIT from 48 to 10000. The cost would be about $7.
I let it run, then I read the code. extraction.py
comments = pd.read_csv(f"inputs/{config['input']}.csv")
limit = config['extraction']['limit']
comment_ids = (comments['comment-id'].values)[:limit]
Ah, I see. Thatβs how it works. Then you put in the data for 2013, but suddenly itβs down to 48!
Weβve come this far and now weβre dead. :
99%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 666/671 [29:36<00:13, 2.67s/it]
Traceback (most recent call last):
openai.error.InvalidRequestError: This model's maximum context length is 16385 tokens. However, your messages resulted in 19680 tokens. Please reduce the length of the messages.
Iβm wondering if the format has changed for the data after 185001345000002001, or if itβs being combined at the time of CSV extraction. Iβd have to hand scrape that and make it a 1997 case to get it to work. But, well, technically, you can check for such oversize errors before actually hitting the API, and the oversize error messages should be a little easier to debug.
Weβre making progress. :
100%|ββββββββββββββββββββββββββββββββββββββ| 666/666 [30:12<00:00, 2.72s/it]
Running step: embedding
100%|ββββββββββββββββββββββββββββββββββββββ| 6/6 [00:15<00:00, 2.65s/it]
Running step: clustering
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2024-05-31 04:37:45,187 - BERTopic - Reduced dimensionality
2024-05-31 04:37:45,284 - BERTopic - Clustered reduced embeddings
Running step: labelling
100%|ββββββββββββββββββββββββββββββββββββββ| 7/7 [00:04<00:00, 1.74it/s]
Running step: takeaways
100%|ββββββββββββββββββββββββββββββββββββββ| 7/7 [00:16<00:00, 2.35s/it]
Running step: overview
Iβm wondering if the format has changed for the data after 185001345000002001, or if itβs being combined at the time of CSV extraction. Conclusion, from the third PDF, the
β receipt number
is preceded by a space or not.
- That some of them have been combined because the delimiting condition was that they begin with
βreceipt number
. - Fixed, there are 2985 cases.
:
93%|ββββββββββββββββββββββββββββββββββββββ | 556/601 [1:33:50<07:35, 10.13s/it]
Traceback (most recent call last):
File "/Users/nishio/talk-to-the-city-reports/scatter/pipeline/steps/translation.py", line 102, in translate_batch
parsed = [a.strip() for a in json.loads(response)]
^^^^^^^^^^^^^^^^^^^^
...
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 382 (char 381)
No, no, no, seriously?
Same cause as this.
-
json.decoder.JSONDecodeError: Invalid \uXXXX escape: line 1 column 1098 (char 1097)
-
Hmmm, this is tricky.
-
I redid it and it worked.
-
I guess it means retries are not implemented well enough, unconfirmed. Iβm not going to run it for an hour and a half and start over because it rubbed me the wrong wayβ¦
Letβs turn off the translation once and try to finish it.
For example, if we do this and say, βLetβs try a larger number of clusters,β will the translation be redone?
- Iβm assuming that the cluster summary text, for example, is regenerated and needs to be retranslated, but the EXTRACTION results donβt, do they?
- I think it should be cached and reused, but I wonder if there is a mechanism for that.
I was thinking, βThe translation will be finished in 5 minutes, so Iβll deploy and go to bed when itβs done,β when the Chinese translation phase started (a little over an hour to go).
2024-05-31 Itβs done.
Youβve had a lot of trouble and redoing things, but youβre still within the expected amount.
If we do something like βletβs try more clusters,β will the translation be redone? :
% python main.py configs/aipubcom.json
(!) {'step': 'clustering', 'filename': 'clusters.csv', 'dependencies': {'params': ['clusters'], 'steps': ['embedding']}, 'options': {'clusters': 8}} step parameter 'clusters' changed from '7' to '50'
diff_params ['clusters']
So, here is what I am planning to run:
{'step': 'extraction', 'run': False, 'reason': 'nothing changed'}
{'step': 'embedding', 'run': False, 'reason': 'nothing changed'}
{'step': 'clustering', 'run': True, 'reason': 'some parameters changed: clusters'}
{'step': 'labelling', 'run': True, 'reason': 'some dependent steps will re-run: clustering'}
{'step': 'takeaways', 'run': True, 'reason': 'some dependent steps will re-run: clustering'}
{'step': 'overview', 'run': True, 'reason': 'some dependent steps will re-run: labelling, takeaways'}
{'step': 'translation', 'run': True, 'reason': 'some dependent steps will re-run: labelling, takeaways, overview'}
{'step': 'aggregation', 'run': True, 'reason': 'some dependent steps will re-run: clustering, labelling, takeaways, overview, translation'}
{'step': 'visualization', 'run': True, 'reason': 'some dependent steps will re-run: aggregation'}
Looks good? Press enter to continue or Ctrl+C to abort.
It looks like the translation only needs to be the part that changes.
:
100%|ββββββββββββββββββββββββββββββββββββββββββββ| 606/606 [2:31:52<00:00, 15.04s/it]
Translating to Taiwan...
9%|βββββββββ | 53/606 [08:00<1:13:18, 7.95s/it]
What, it looks like you spent 2.5 hours re-translating it?
:
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ| 605/606 [1:38:47<00:09, 9.80s/it]
Traceback (most recent call last):
File "/Users/nishio/talk-to-the-city-reports/scatter/pipeline/steps/translation.py", line 119, in translate_batch
return translate_batch(batch, lang_prompt, model, retries - 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
RecursionError: maximum recursion depth exceeded
python main.py configs/aipubcom.json 56.16s user 33.46s system 0% cpu 4:13:37.44 total
What the hell full stacktrace Youβve taken four hours and failed with an error that is only probabilistic. The cost isnβt that high. And thereβs something wrong with the translation running for four hours in a situation that only changed the number of clusters this time in the first place. The probability of having to start over increases the larger the data size because there is a lack of care for probable problems.
The Japanese version has already been translated, but the Taiwanese version was a flop, so there is no data for the Japanese translation. The 50 clusters divided into 50 clusters are all sorts of unique and interesting. It might be interesting to create a question from each of these clusters that can be divided into pros and cons and polis it.
Future Direction
- The report for Japanese users with Japanese input will be able to be executed in Japanese instead of English, which is currently done internally.
- Improve the translation script a bit more, to the point where you can check the cluster in English without translation first, and then add more languages later, Japanese, β¦ and so on.
- Itβs difficult to process data from 20,000 people in the current situation, where 5,000 data points or so are created from 2,000 human pubic comments, all of which are plotted, and clustering is applied in the previous step to combine similar opinions into one.
2024-06-03
- Visualize the contents of the Plurality book with Talk to the City Well, to filter when this kind of garbage is generatedβ¦
- There doesnβt seem to be an official, easy way to do this.
Add a filter to `embedding.pyβ and force re-run
util.py
for (i, option) in enumerate(sysargv):
if option == '-f':
config['force'] = True
if option == '-o':
config['only'] = sysargv[i + 1]
if option == '-skip-interaction':
config['skip-interaction'] = True
Oh, no, theyβll check here.
ValueError: Make sure that the embeddings are a numpy array with shape: (len(docs), vector_dim) where vector_dim is the dimensionality of the vector embeddings.
Iβll have to rewrite the args.csv in front of me. filter_argument.py
import pandas as pd
# Load the arguments from the CSV file
arguments = pd.read_csv("outputs/plurality/args.csv")
# List of arguments to ignore
ignores = ["error", "argument", "reference", "references", "source", "sources", "message", "contributions"]
# show size
print(len(arguments))
# Filter out the ignored arguments
filtered_arguments = arguments[~arguments['argument'].isin(ignores)]
# show size
print(len(filtered_arguments))
# write filtered_arguments to output/plurality/args.csv
filtered_arguments.to_csv("outputs/plurality/args.csv", index=False)
arguments_array.shape=(3497,), embeddings_array.shape=(3353, 1536)
I updated the args, but for some reason itβs not re-running. force with -o again.
arguments_array.shape=(3497,), embeddings_array.shape=(3497, 1536)
Itβs done.
Oh, better. Released. https://nishio.github.io/tttc_plurality/
2024-06-17
Encountered 15 files that should have been pointers, but werenβt:
This page is auto-translated from [/nishio/Talk to the City試γ](https://scrapbox.io/nishio/Talk to the City試γ) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iβm very happy to spread my thought to non-Japanese readers.