RAKE stop list generation

Stop list generation for [RAKE

Words adjacent to but not included in the keyword are good candidates for [stopword
Both precision and recall were improved by excluding from the stop list words whose frequency in the keyword is higher than the frequency adjacent to the keyword.
F value is best with the largest stop list and could be made even better by making it larger.
- The stop list made by DF alone is worse than the opposite.
- Note that the training was performed on 1000 of the 2000 abstracts, and DF was tried on 10 or more, 25 or more, and 50 or more, respectively.
  - Interesting to use DF instead of TF. impressions
I see what you mean
- I don’t like the idea of splitting it into two values in a snap based on “more or less,” but that’s not the problem with this algorithm, it’s the problem with the concept of stopwords in the first place.
  - Extending to the domain of probability, we can say that when a word is observed, the probability that it is part of a keyword and the probability that it is adjacent to a keyword
  - If a keyword is at the edge of a candidate, it is reasonable to think about whether to expand the candidate or not, and to say, “Let’s not expand it because the probability of it being adjacent is higher.
  - Not when it’s among the keyword candidates, right?
    - For example, “of” and “of” have a low probability of being present at the middle end of keyword candidates, but they are rather common in some of them.
    - So if it’s overdivided, if it comes up more than once, it’s rejoined, but I don’t know if it’s a good balance…
- Each word has a “probability of being among the keyword candidates”.
  - The stopword is it’s zero, and the state it approximates.
    - Scores to be multiplied
  - Score for grammatical appropriateness of the keyword form
    - The score of a token sequence is calculated as the product of the scores of each token
    - State that this is 0/1 depending on whether this is a stop list element or not.
    - If it is a probability value, the more you multiply, the smaller it gets.
    - →Bias for selection of short key phrases
      - Offset by [Bias for longer key phrases to be selected
If you have a “document with some keywords already marked up” like Scrapbox, you can get a stop list from there.

This page is auto-translated from /nishio/RAKEのストップリスト生成 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.

🪴 Quartz 4.0

RAKE stop list generation

Graph View

Backlinks