image

  • imageimage

PDF

  • The concept of [concentration (of one’s attention)
  • Number of sentences in which the string appears twice ÷ Number of sentences in which the string appears once
  • Probability of two or more occurrences of a string conditional on the string occurring once
    • If a string occurs according to the Poisson distribution, it matches the word frequency.
    • Actual substrings are more widely distributed.
    • The distribution of human-determined keywords is concentrated in that part of the market.

score

  • The “number of sentences in which the string appears once” is in essence DF, so dividing by this is IDF.
  • In tf-idf, the numerator is tf, but here we have DF2
  • If df2 is less than 3, the logarithmic score is set to -10000
  • If the score exceeds 0.5, saturate it with 0.5.

Applications QuickSolution - Wikipedia

This can be interpreted as a function that takes a string as an argument and returns a “keyword-like” value

  • Not “[Keyword character representative of this document.
  • It is “the key word character in this language.”
  • This function must be obtained from a large corpus in advance
    • In what form should it be preserved?
    • Sufficient information in the form of suffix array, but overly detailed
      • Just the number of occurrences in different sentences, and it is enough to know if it is 0, 1, or 2 or more.

Attributing keyword extraction to the problem of [word segmentation

  • The problem of finding the division that maximizes the product of the scores of each of the divided strings
    • Since the score is less than 1, it is better not to split if the score is the same
    • Why? Because substrings of the → keywords also score high.
  • I think the structure would be mathematically similar to SentencePiece.

After splitting, keywords are those that satisfy the following conditions

  • Frequency ranges from 0.00005 to 0.1
  • Logarithmic score greater than -1
  • Length 2 or more

Suffix Array


This page is auto-translated from /nishio/未踏テキスト情報中のキーワードの抽出システム開発 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.