Even if you simply set a character to distributed representation, if you end up with a character you have never seen before, it will be set to unknown language (unknown character). If you convert it to features and create a distributed representation from it, even unknown characters would be okay.

What feature values can be obtained?

  • alphabetization
  • A-Z.
  • capitalization
  • symbolization
  • full-width digit
  • cursive Japanese syllabary used primarily for native Japanese words (esp. function words, inflections, etc.)
  • angular Japanese syllabary used primarily for loanwords
  • turning a kanji into a numerical value (e.g. “kanji” is up + down + sideways)
    • radical (of a kanji character)

Frequency Analysis of Character Appearance in Japanese Sentences-WentWayUp


This page is auto-translated from /nishio/文字の特徴量 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.