Talk about ā€œanother small modelā€ being futile, but ā€œa tokenizer plus an extra layer suitable for Japaneseā€ is necessary.

from Unexplored Conference 2023 Is a Japanese language model necessary? https://twitter.com/nishio/status/1634236182136762368

  • human.iconIs a Japanese language model necessary?
  • nishio.iconHowever, if the efficiency of communication in Japanese with a giant language model is lower than that in English, it is a loss for Japanese speakers, so there is no need to do anything.
  • human.iconAre you asking me to increase the number of Japanese language study data?
  • nishio.iconNo, different language speakers have conflicting interests because tokenizer token splits follow the frequency of occurrence of words in the language. One model with more data is not good enough.
  • human.iconIf machine translation evolves, it wonā€™t matter.
  • nishio.iconIf the English word sequence is then converted to a Japanese word sequence, the resolution of the world by the English language will be lower to match the English language in a lower domain than the Japanese language, so the language can only be inferior to the English language.

@kuboon I think I heard something about recent machine translation being converted once into an intermediate language that is not any language, like LLVM for natural languages. @nishio What humans imagine by the term ā€œintermediate languageā€ is a sequence of ā€œdiscrete symbolsā€ called words, but in the first place, in LLM, what corresponds to a single word is a vector with a float of 1000 dimensions or so, and its expressive power is an order of magnitude higher. Therefore, it is necessary to ā€œdirectly convert it into Japaneseā€ rather than ā€œconvert it into English and then into Japanese. @kuboon I was envisioning machine translation directly from a float 1000 dimensional word sequence into natural language, without output from LLM into English. @nishio If it is necessary and the accuracy there is inferior for a certain natural language, only the speakers of that language will suffer a loss. This is not a problem that can be solved by leaving it to the speakers of other languages, since there are conflicting interests among the speakers of each language. @kuboon Would it be good if there was a mechanism to standardize the tokenizer API so that native speakers of each language could commit to it? @nishio The tokenizer alone is too thin. Itā€™s just a discrete token ID of a sequence of bytes. I think we need an opening that can I/O information around that token ID becoming a vector, and then going a little further inward, where the superficial differences in language disappear and it becomes a vector of meaning.

relevance - Do differences in Japanese Tokenizer affect downstream task performance? - Morphological analysis improves performance.


This page is auto-translated from /nishio/ę—„ęœ¬čŖžć®č؀čŖžćƒ¢ćƒ‡ćƒ«ćÆåæ…要恋ļ¼Ÿ using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iā€™m very happy to spread my thought to non-Japanese readers.