The naive vector similarity method separates English and Japanese.

  • ベクトル空間で英語と日本語は分離してる.icon

For me personally, this is a big problem, so I’m not very satisfied with a simple embedded vector similarity

  • But very few people use both English and Japanese, so you don’t have to worry about it.

But when I went to bed and woke up, I felt like, “Why don’t we machine translate every English sentence into Japanese and load it into a vector index?

  • Better than making it into a vector and then thinking “how do I paste together the spaces that are far apart”?

  • In Plurality Japanese translation, the language is now included in the vector index without separating the languages.

  • [/plurality-japanese/Plurality Vector Search](https://scrapbox.io/plurality-japanese/Plurality Vector Search)

  • I feel like I can’t provide user value with that.

    • If it is Japanese who use it, the query is also usually in Japanese
    • But when I search for idioms, I get Chinese hits.
  • If so, it would be better to load the “Japanese” side of “other language Japanese” into the search index.

    • Then, turn the arrow in the opposite direction and read the original text after the Japanese hits if you want to.

The association of two chunks by machine translation is a kind of “link”

  • Hits by vector search are loose links
  • Compare with explicit human links in Scrapbox
    • Link meaning “same content, different language.”
    • Scrapbox link to “Human Associations” link

What to record


This page is auto-translated from /nishio/日記2024-01-14 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.