Document Frequency
- Affected by [Document Granularity
- As an extreme example, if we take one word and one document, we can match the TF
- Often set to “1” if it appears more than once.
- For concentration (of one’s attention), also use the value “if it appears two or more times
- I mean, you’re multiplying a step function.
- The number of times is used as the threshold, a value that naturally tends to increase as the number of words in the document increases
- Wouldn’t it be better to divide by the number of words to get the probability of occurrence…
This page is auto-translated from /nishio/DF using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.