image

nishiogleninjapan In the lightning talk, I didn’t delve into mathematical discussions due to limited time. For a more detailed explanation, I created diagrams by embedding the meanings of words into vector spaces using LLM. image

nishio This is a two-dimensional visualization of the meanings of each word embedded in a high-dimensional space using OpenAI’s text embedding API. In simple terms, it demonstrates how AI recognizes the similarity in meanings of words like this.

nishio Plotting two languages on a single chart is not a straightforward task. In this chart, the first principal component axis of PCA is treated as the axis representing the differences between languages and has been removed.

nishio Here is annotated version. The plotted words are a combination of those I have considered and those that GPT-4 has suggested as being similar. So it shows GPT4 can not find English words similar to Japanese Nattoku. Understanding and agreement is major explanation in dictionaries image

nishio One word can bridge multiple concepts. In this example, the Japanese word “納得” ( nattoku) serves as a bridge connecting concepts like “understanding”, “agreement”, and “satisfaction”. Similarly, in Mandarin, “數位” (shùwèi) connects concepts like “digital” and “plural”.

nishio In the mapping from a high-dimensional space(H) to a low-dimensional space(L), objects that are close in H will generally remain close in L. However, there is no guarantee that objects far apart in H will also be far apart in L. nishio You can think of it like imagining the shadow of a three-dimensional object. Therefore, the absence of proximity in a low-dimensional space can be useful for understanding a high-dimensional space.

Making

image

  • Simple PCA generated this
    • it shows thr difference of languages

imageimage

  • Visualization on each languages
  • Those are good for observing each language only, but those should not be overlapped, by the nature of PCA.
  • In the observation I found some candidate words are far from other words. Those outlier are ommitted from the visualization. We have only two~three dimensions to express information.
    • image

In this chart, the first principal component axis of PCA is treated as the axis representing the differences between languages and has been removed.

  • Finnaly got this
    • image

This page is auto-translated from [/nishio/Nattoku in Vector space](https://scrapbox.io/nishio/Nattoku in Vector space) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.