image

  • [series labeling
    • Eigen-Expression Extraction labels the “beginning of the eigenexpression” and the “range of eigenexpressions”.
    • The RAKE stop list generation algorithm in keyphrase extraction counts the number of times a word is “in” or “next to” a keyphrase.
    • Mapping to series labeling would be labeled “keyphrase range” and “keyphrase adjacency”.
    • Personally, I think it would be better to distinguish between “right neighbor” and “left neighbor”.

Which labeling is better?

  • As for keyphrase extraction, I think it’s more straightforward to label around the latter keywords.
  • What do humans do when they explicitly state that a sentence is indistinguishable from a ground sentence and “it’s a keyword”?
    • For example, enclose in brackets
    • Conversely, even if other conditions are the same, the probability of a key phrase being a key phrase is naturally increased in the area enclosed by brackets.
    • With the latter labeling, the ” ” token naturally corresponds to “the label to the left of the keyword.
  • What is the reason why eigenexpression extraction often uses the former labeling?
    • Eigenexpression may be contiguous.
    • The method of labeling the perimeter is impossible.

PS image

  • I was comparing 2-1 and 3, but there are more detailed steps.

  • 4 can identify consecutive keywords.

  • 5 can distinguish “words that do not appear at the end of keywords but often appear within keywords” such as “of

  • end labeling


This page is auto-translated from /nishio/固有表現抽出とキーフレーズ抽出 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.