Even if you simply set a character to distributed representation, if you end up with a character you have never seen before, it will be set to unknown language (unknown character). If you convert it to features and create a distributed representation from it, even unknown characters would be okay.
What feature values can be obtained?
- alphabetization
- A-Z.
- capitalization
- symbolization
- full-width digit
- cursive Japanese syllabary used primarily for native Japanese words (esp. function words, inflections, etc.)
- angular Japanese syllabary used primarily for loanwords
- turning a kanji into a numerical value (e.g. “kanji” is up + down + sideways)
- radical (of a kanji character)
Frequency Analysis of Character Appearance in Japanese Sentences-WentWayUp
-
Is it enough if the frequent ones are well represented?
-
Should the percentage of correct answers in word2vec be higher?
This page is auto-translated from /nishio/文字の特徴量 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.