-
PDFs converted to text are separated by line
-
Lines are broken even in the middle of a sentence and must be combined before natural language processing
-
But it’s not 100% good to connect.
- headline
- itemization
- Figure Caption.
- code
- footnote
- numerical formula
- URL
- numerical formula ・Codes, URLs, etc. are heterogeneous and should be removed.
-
What should not be directly connected and what may be connected
This page is auto-translated from /nishio/行継続判定 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.