- The concept of [concentration (of one’s attention)
- Number of sentences in which the string appears twice ÷ Number of sentences in which the string appears once
- Probability of two or more occurrences of a string conditional on the string occurring once
- If a string occurs according to the Poisson distribution, it matches the word frequency.
- Actual substrings are more widely distributed.
- The distribution of human-determined keywords is concentrated in that part of the market.
score
- The “number of sentences in which the string appears once” is in essence DF, so dividing by this is IDF.
- In tf-idf, the numerator is tf, but here we have DF2
- If df2 is less than 3, the logarithmic score is set to -10000
- If the score exceeds 0.5, saturate it with 0.5.
Applications QuickSolution - Wikipedia
This can be interpreted as a function that takes a string as an argument and returns a “keyword-like” value
- Not “[Keyword character representative of this document.
- It is “the key word character in this language.”
- This function must be obtained from a large corpus in advance
- In what form should it be preserved?
- Sufficient information in the form of suffix array, but overly detailed
- Just the number of occurrences in different sentences, and it is enough to know if it is 0, 1, or 2 or more.
Attributing keyword extraction to the problem of [word segmentation
- The problem of finding the division that maximizes the product of the scores of each of the divided strings
- Since the score is less than 1, it is better not to split if the score is the same
- Why? Because substrings of the → keywords also score high.
- I think the structure would be mathematically similar to SentencePiece.
- see SentencePiece unigram language model
- The “keywordiness” score reads “wordiness.”
- Find the word segmentation with the maximum likelihood using the [Viterbi algorithm
After splitting, keywords are those that satisfy the following conditions
- Frequency ranges from 0.00005 to 0.1
- Logarithmic score greater than -1
- Length 2 or more
This page is auto-translated from /nishio/未踏テキスト情報中のキーワードの抽出システム開発 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.