- Especially OCR’d scans of old books, which sometimes fail to identify and are garbage strings.
- This is not good when mixed in with the original data for language modeling.
- I think it’s easy to get because of the blatant bias in the letters that appear.
This page is auto-translated from /nishio/OCRゴミ掃除 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.