Phrase-based TF-IDF: Application of Noun Phrase Analysis

- [[Phrase-based TF-IDF]] : Application of [NPP
- [[tf-idf]]
    - [[keyphrase extraction]]
    - [[Yugo Murawaki]]
- Abstract Recognition of key words in a document is a fundamental task for many applications. Such words are often not words but word sequences. However, state-of-the-art unsupervised methods score words by summing TF-IDFs and do not recognize the semantic cohesion of word sequences. In this paper, we propose a method to calculate TF-IDFs directly for phrases consisting of multiple words by applying the internal structure analysis of noun phrases, and investigate its behavior.

Conundrums in Unsupervised Keyphrase Extraction showed that word-based TF-IDF is stronger than methods such as [TextRank
- Word-based TF-IDF defines the phrase score as the sum of the TF-IDFs of the component words
- In addition to that, the heuristics of limiting to the longest noun phrase is unconsciously used
- This avoids the grammaticality problem.
unithood
- “the degree of strength or stability of syntagmatic combinations or collocations”
  - Degree to which the word sequence functions as a cohesive entity
termhood
- “the degree that a linguistic unit is related to (or more straightforwardly, represents) domain-specific concepts”
  - Degree to which a word sequence is associated with a particular concept.
The word-based TF-IDF is
- $tfidfW (w) = tf (w) \times l o g (D / D_{w})$
  - where tf is the frequency of occurrence of the word wi in sentences, D is the number of all sentences, and $D_{w_{i}}$ is the number of sentences containing the word wi
- $term (p) = \sum_{w \in p} tfidfW (w)$
  - p: phrase
- $unit (p) = (p \in longest (doc))? 1 : 0$
  - longest is the longest noun phrase
- $tfidf (p) = unit (p) \times term (p)$
For the definition of UNIT, only whether it is the longest noun phrase is used.
- If we were to include all the subword strings of the longest noun phrase
  - $unit (p) = (p \in all (doc))? 1 : 0$
  - Effective in raising the upper bound of Recall, but overall worsened accuracy.
  - This means that unit = 1 is not good, even for inappropriate partial noun phrases.
  - Then give the appropriate score for the partial noun phrase

This approach improves accuracy for longer sentences. Not improved for short sentences

https://ipsj.ixsq.nii.ac.jp/ej/?action=repository_action_common_download&item_id=95866&item_no=1&attribute_id=1&file_no=1

This page is auto-translated from [/nishio/フレーズベースTF-IDF: 名詞句解析の応用](https://scrapbox.io/nishio/フレーズベースTF-IDF: 名詞句解析の応用) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.

🪴 Quartz 4.0

Phrase-based TF-IDF: Application of Noun Phrase Analysis

Graph View

Backlinks