from BERTopic: Neural topic modeling with a class-based TF-IDF Class-Based TF-IDF Class-based TF-IDF is a new topic representation method proposed in BERTopic.
Normal TF-IDF calculates the importance of a word in each document. On the other hand, class-based TF-IDF groups documents by topic and considers them as one pseudo document, and calculates the importance of words in that document class.
Specifically, it is calculated as follows
Wt,c = tft,c · log(1 + A/tft)
- tft,c : frequency of occurrence of the word t in documents contained in topic c
- A : Average frequency of words in the corpus
- tft : Frequency of occurrence of word t in the whole corpus
While regular TF-IDF treats each document independently, class-based TF-IDF is unique in that it groups documents by topic. This allows for a more direct calculation of the importance of words that characterize a topic.
BERTopic states that using this class-based TF-IDF for generating topic representations has yielded better results than traditional cluster-centric-based topic representations. It can be said that this method contributes to improving the interpretability of topics.
This page is auto-translated from /nishio/クラスベースTF-IDF using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.