Stop list generation for [RAKE
-
Words adjacent to but not included in the keyword are good candidates for [stopword
-
Both precision and recall were improved by excluding from the stop list words whose frequency in the keyword is higher than the frequency adjacent to the keyword.
-
F value is best with the largest stop list and could be made even better by making it larger.
- The stop list made by DF alone is worse than the opposite.
- Note that the training was performed on 1000 of the 2000 abstracts, and DF was tried on 10 or more, 25 or more, and 50 or more, respectively.
- Interesting to use DF instead of TF. impressions
-
I see what you mean
- I don’t like the idea of splitting it into two values in a snap based on “more or less,” but that’s not the problem with this algorithm, it’s the problem with the concept of stopwords in the first place.
- Extending to the domain of probability, we can say that when a word is observed, the probability that it is part of a keyword and the probability that it is adjacent to a keyword
- If a keyword is at the edge of a candidate, it is reasonable to think about whether to expand the candidate or not, and to say, “Let’s not expand it because the probability of it being adjacent is higher.
- Not when it’s among the keyword candidates, right?
- For example, “of” and “of” have a low probability of being present at the middle end of keyword candidates, but they are rather common in some of them.
- So if it’s overdivided, if it comes up more than once, it’s rejoined, but I don’t know if it’s a good balance…
- Each word has a “probability of being among the keyword candidates”.
- The stopword is it’s zero, and the state it approximates.
- Scores to be multiplied
- Score for grammatical appropriateness of the keyword form
- The score of a token sequence is calculated as the product of the scores of each token
- State that this is 0/1 depending on whether this is a stop list element or not.
- If it is a probability value, the more you multiply, the smaller it gets.
- →Bias for selection of short key phrases
- Offset by [Bias for longer key phrases to be selected
- The stopword is it’s zero, and the state it approximates.
- I don’t like the idea of splitting it into two values in a snap based on “more or less,” but that’s not the problem with this algorithm, it’s the problem with the concept of stopwords in the first place.
-
If you have a “document with some keywords already marked up” like Scrapbox, you can get a stop list from there.
This page is auto-translated from /nishio/RAKEのストップリスト生成 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.