- [[Phrase-based TF-IDF]] : Application of [NPP
- [[tf-idf]]
    - [[keyphrase extraction]]
    - [[Yugo Murawaki]]
- Abstract Recognition of key words in a document is a fundamental task for many applications. Such words are often not words but word sequences. However, state-of-the-art unsupervised methods score words by summing TF-IDFs and do not recognize the semantic cohesion of word sequences. In this paper, we propose a method to calculate TF-IDFs directly for phrases consisting of multiple words by applying the internal structure analysis of noun phrases, and investigate its behavior.
- 
Conundrums in Unsupervised Keyphrase Extraction showed that word-based TF-IDF is stronger than methods such as [TextRank - Word-based TF-IDF defines the phrase score as the sum of the TF-IDFs of the component words
- In addition to that, the heuristics of limiting to the longest noun phrase is unconsciously used
- This avoids the grammaticality problem.
 
- 
unithood - “the degree of strength or stability of syntagmatic combinations or collocations”
- Degree to which the word sequence functions as a cohesive entity
 
 
- “the degree of strength or stability of syntagmatic combinations or collocations”
- 
termhood - “the degree that a linguistic unit is related to (or more straightforwardly, represents) domain-specific concepts”
- Degree to which a word sequence is associated with a particular concept.
 
 
- “the degree that a linguistic unit is related to (or more straightforwardly, represents) domain-specific concepts”
- 
The word-based TF-IDF is - 
- where tf is the frequency of occurrence of the word wi in sentences, D is the number of all sentences, and is the number of sentences containing the word wi
 
- 
- p: phrase
 
- 
- longest is the longest noun phrase
 
- 
 
- 
- 
For the definition of UNIT, only whether it is the longest noun phrase is used. - If we were to include all the subword strings of the longest noun phrase
- 
- Effective in raising the upper bound of Recall, but overall worsened accuracy.
- This means that unit = 1 is not good, even for inappropriate partial noun phrases.
- Then give the appropriate score for the partial noun phrase
 
- 
 
- If we were to include all the subword strings of the longest noun phrase
This approach improves accuracy for longer sentences. Not improved for short sentences
This page is auto-translated from [/nishio/フレーズベースTF-IDF: 名詞句解析の応用](https://scrapbox.io/nishio/フレーズベースTF-IDF: 名詞句解析の応用) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.