Document Frequency

  • Affected by [Document Granularity
  • As an extreme example, if we take one word and one document, we can match the TF
  • Often set to “1” if it appears more than once. - For concentration (of one’s attention), also use the value “if it appears two or more times
    • I mean, you’re multiplying a step function.
    • The number of times is used as the threshold, a value that naturally tends to increase as the number of words in the document increases
      • Wouldn’t it be better to divide by the number of words to get the probability of occurrence…

This page is auto-translated from /nishio/DF using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.