RAKE :

input:
I was looking at the Wikipedia dump data, and the link notation is a nice one token in SentencePiece, so I think I can use this. I'll have to do a lot of preprocessing, though.

SentencePiece / 100 document stoplist :

output:
Link notation is sentencepiece: 81.00
wikipedia:      25.00
View dump data: 16.00
Becoming 1 token: 16.00
Various: 4.00
Pretreatment: 4.00
Also good: 4.00
Likely to be able to: 4.00
  • “Link notation is S ent ence P ie ce”.
  • “W i ki pe dia”
  • It’s divided into 5 tokens, and “Wikipedia” is correctly bundled.
  • The “law” is omitted from the stop list because of one token.
    • The stop list only uses about 100 pages of Wikipedia, so more could be on the stop list.
    • I want “link notation” to be a keyword in the first place, so SentencePiece’s ticking may be inappropriate.
      • Maybe the token is too large because it is the one with vocab_size: 32000 that was used for the BERT Japanese model.

Mecab / 1000 document stoplist :

Not good: 7.50
Not done: 4.50
Pretreatment: 4.00
Link notation: 4.00
So here it is: 4.00
Likely to be able to: 4.00
1 token: 4.00
In the name of: 3.50
but: 2.00
Hands: 1.50
Dump data: 1.00
Things: 1.00
Good: 1.00
wikipedia:      1.00
sentencepiece:  1.00
  • I shouldn’t, but I won’t, so this, so I can.”
    • This certainly doesn’t look like much on Wikipedia.
  • Other disappointments

Mecab / 1000 document stoplist / use average phrase-character-length instead of average pharase-token-length :

I shouldn't have: 15.00
sentencepiece:  13.00
Link notation: 10.00
1 token: 10.00
Not done: 9.00
wikipedia:      9.00
So here it is: 8.00
Possible: 8.00
Pretreatment: 6.00
Dump data: 6.00
NAME: 5.00
but: 4.00
Hand: 2.00
Things: 2.00
Good: 2.00

This page is auto-translated from /nishio/RAKE実験1 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.