- [[Phrase-based TF-IDF]] : Application of [NPP
- [[tf-idf]]
- [[keyphrase extraction]]
- [[Yugo Murawaki]]
- Abstract Recognition of key words in a document is a fundamental task for many applications. Such words are often not words but word sequences. However, state-of-the-art unsupervised methods score words by summing TF-IDFs and do not recognize the semantic cohesion of word sequences. In this paper, we propose a method to calculate TF-IDFs directly for phrases consisting of multiple words by applying the internal structure analysis of noun phrases, and investigate its behavior.
-
Conundrums in Unsupervised Keyphrase Extraction showed that word-based TF-IDF is stronger than methods such as [TextRank
- Word-based TF-IDF defines the phrase score as the sum of the TF-IDFs of the component words
- In addition to that, the heuristics of limiting to the longest noun phrase is unconsciously used
- This avoids the grammaticality problem.
-
unithood
- “the degree of strength or stability of syntagmatic combinations or collocations”
- Degree to which the word sequence functions as a cohesive entity
- “the degree of strength or stability of syntagmatic combinations or collocations”
-
termhood
- “the degree that a linguistic unit is related to (or more straightforwardly, represents) domain-specific concepts”
- Degree to which a word sequence is associated with a particular concept.
- “the degree that a linguistic unit is related to (or more straightforwardly, represents) domain-specific concepts”
-
The word-based TF-IDF is
-
- where tf is the frequency of occurrence of the word wi in sentences, D is the number of all sentences, and is the number of sentences containing the word wi
-
- p: phrase
-
- longest is the longest noun phrase
-
-
-
For the definition of UNIT, only whether it is the longest noun phrase is used.
- If we were to include all the subword strings of the longest noun phrase
-
- Effective in raising the upper bound of Recall, but overall worsened accuracy.
- This means that unit = 1 is not good, even for inappropriate partial noun phrases.
- Then give the appropriate score for the partial noun phrase
-
- If we were to include all the subword strings of the longest noun phrase
This approach improves accuracy for longer sentences. Not improved for short sentences
This page is auto-translated from [/nishio/フレーズベースTF-IDF: 名詞句解析の応用](https://scrapbox.io/nishio/フレーズベースTF-IDF: 名詞句解析の応用) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.