Character Features

Even if you simply set a character to distributed representation, if you end up with a character you have never seen before, it will be set to unknown language (unknown character). If you convert it to features and create a distributed representation from it, even unknown characters would be okay.

What feature values can be obtained?

alphabetization
A-Z.
capitalization
symbolization
full-width digit
cursive Japanese syllabary used primarily for native Japanese words (esp. function words, inflections, etc.)
angular Japanese syllabary used primarily for loanwords
turning a kanji into a numerical value (e.g. “kanji” is up + down + sideways)
- radical (of a kanji character)

Frequency Analysis of Character Appearance in Japanese Sentences-WentWayUp

Is it enough if the frequent ones are well represented?
Should the percentage of correct answers in word2vec be higher?
feature value of character

This page is auto-translated from /nishio/文字の特徴量 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.

🪴 Quartz 4.0

Character Features

Graph View

Backlinks