Cybozu Labs Machine Learning Study Group
- keyphrase extraction
- Extract several short strings of less than one sentence from an input string longer than one sentence
- This time around.
Word Frequency Approach
- Carve input into words
- Count the number of occurrences of each word.
- Score the ânumber of timesâ.
- This is called TF (Term Frequency)
- Showing the highest score
problem
- Frequent occurrence of words like âofâ and âwo.â
- This is a frequent occurrence in any text.
- I want to lower my score.
- Count the number of documents in which the word x appears
- Large DF words are âcommonâ words
- This is called DF(Document Frequency)
- So, the larger the DF, the smaller the score, which is TF-IDF
- (aside)
- It is interesting to see a characteristic distribution for key phrases (concentration (of oneâs attention)) when plotted on two axes with DF2, âthe number of documents in which the word x appears two or more times.
- Iâd take a log, but there doesnât seem to be much theoretical support for it.
- Affected by average number of words in âdocument,â etc.
- If you use books, is one document a âbookâ or a âpageâ?
- If itâs groupware, is it one post, one thread, one space?
general managersâ committee issue
- The term âGeneral Managerâs Meetingâ refers to a specific recurring event throughout this in Cybozu.
- But when chopped into words, they are divided into âbook,â âdirector,â and âassociation.â
- It gets mixed up with âbookâ meaning âbookâ and âmeetingâ meaning âmeetingâ of other events.
- Even if the word âdirectorâ was used with high frequency, it is difficult for a human being to think from it that you are talking about the General Managerâs Association.
- If it is to be shown to humans, it is necessary to extract âsequences of multiple wordsâ as key phrases
- âTextRank: Bringing Order into Textsâ
- If you chop them into words and treat them discretely, you lose information about word order, which is not good.
- Then, a graph is created with words as vertices and adjacencies as edges, which is multiplied by eigenvalue analysis to obtain a score for each word.
- Inspired by Googleâs PageRank, hence the name TextRank.
- Words in the top 1/3 of the score are concatenated if they occur consecutively in the source text
- problem
- Q: Wouldnât high frequency words score higher?
- A: Yes, so TextRank uses only nouns and adjectives (noun phrase approach).
Problems with the noun phrase approach
- Narrowing down the candidates too much
- Neither âintellectual production techniques of engineersâ nor â100 people, 100 different ways of workingâ can be key phrases.
- The âLean Startupâ is carved in middle black.
- Analysis of data from users who frequently use Scrapbox showed that 10% to 30% of key phrases contained non-noun adjectives â Scrapbox Statistics 2019-2.
-
Rapid Automatic Keyword Extraction
-
Same issues as Nishio.
- I want to take out âaxis of evilâ as a key phrase.
-
structure
- 1: First, divide by the words in the stop list
- 2: If key phrases appear in the same order, they are combined, including the words in between.
-
The key is how to make a stop list.
- TextRankâs noun phrase approach is equivalent to putting all but noun adjectives in the stop list
-
The paper tries two patterns six different ways.
- Pattern TF:.
- The idea that words that occur frequently are stop targets.
- Use a lot of word DFs.
- Pattern KA:.
- The idea is that words that appear next to key phrases and do not appear easily in key phrases are stop targets.
- If the number of occurrences in a key phrase is greater than the number of occurrences next to the key phrase, exclude it from the stop list
- Pattern TF:.
-
result
- KA is good all around.
- In TF, increasing the number of stoplists decreases performance, in KA it increases it.
- With KA, âwords that also occur frequently in key phrasesâ donât make the stop list, so they can be increased.
benchmark
- Speed is significantly faster than TextRank
- TextRank is heavy on solving large matrix eigenvalue problems.
- Accuracy is best in both F value and percentage of fit.
- The conformance rate is the percentage of key phrases that the system determines to be key phrases that are actually key phrases.
- TextRank methods have evolved in various ways since then: PositionRank, EmbedRank, and so on.
supplementary examination
- Assume that the key phrase is the range enclosed by the Wikipedia link notation
[[ ]]
. - Generate stop list with 1000/2000 pages
- Key phrase extraction against 1000 other pages
- Example: ampersand
- INPUT
ampersand (&) is a symbol for the juxtaposition of the particles "...and...". It is a ligature of the Latin "et" and is easily recognized in the Trebuchet MS font as <FILE>. ampersa, "and per se and", means "and [the symbol which] by itself [is] and". ...
- EXPECTED
['symbol', 'Latin', 'ligature', 'Trebuchet MS', ...
- OUTPUT
['symbol which] by itself [is] and', 'ampersand (&, English name', 'and per se and', 'Trebuchet MS font', 'and [', 'ampersand', 'parallel particle', 'ampersa', 'Latin ', 'FILE', 'understand', 'symbol', 'display', 'meaning', 'ligature', 'et', ...
Precision: 0.208, Recall: 0.808, F-measure: 0.331
- INPUT
- Precision 0.32+-0.30 for 1000 pages
- Roughly similar results to the paper were reproduced.
- Large standard deviation
- It may be due to the fact that the abstracts of the papers are generally of similar length, whereas the random pages on Wikipedia are of varying lengths.
- small talk
- Iâm taking the top 1/3 of the paper.
- Normally, Precision goes down as you increase it.
- In this experiment, it went up the other way.
- High scoring key phrases have a high probability of not being correct.
Comparison
- TextRank
- Stop list: non-noun adjectives
- Combining: if high scoring adjacent to each other
- RAKE
- Stop list: compare inside and adjacent to key phrases
- Combining: if they appear in the same order
Stop List Considerations
- Stop list can be regarded as a âstring â 0/1â function
- What is âchopping at the stop word?â
- The red part is 1-p and then âthe product is 1.â
- Consistent with the comparison of ânumber of occurrences within a keywordâ and ânumber of occurrences next to a keyword.â
- Is 0/1 sufficient?
- âsâ and so on, in, but also adjacent to.
- Unknown words are not on the stop list.
- You donât want to have âtokens that shouldnât be in keywordsâ that appear infrequently.
- For example, if a URL is mixed in with the input :
f noun,general,*,*,*,*,*
3 noun,number,*,*,*,*,*
b noun,general,*,*,*,*,*
1 noun,number,*,*,*,*,*
af noun,general,*,*,*,*,*
2 noun,number,*,*,*,*,*
daee noun,general,*,*,*,*,*
8 noun,number,*,*,*,*,*
e noun,general,*,*,*,*,*
942 noun,number,*,*,*,*,*
b noun,general,*,*,*,*,*
49 noun,number,*,*,*,*,*
adbb noun,general,*,*,*,*,*
2 noun,number,*,*,*,*,*
e noun,general,*,*,*,*,*
40 noun,number,*,*,*,*,*
f noun,general,*,*,*,*,*
0 noun,number,*,*,*,*,*
c noun,general,*,*,*,*,*
6746 noun,number,*,*,*,*,*
f noun, proper noun, organization,*,*,*,*
Proposal to make it a stochastic model
- future work
- CRF
- Commonly used in morphological analysis itself and other series labeling
- âTwo or more occurrences of a key phrase sequenceâ is a global feature not used in CRFs, etc.
- How do I make this work?
Regarding keyphrase binding
- Now they donât care what kind or how many tokens they have, theyâre combining them anyway.
'Make a presentation \u3000\u3000â \u3000 (immediately after the action'
)- They are combined because they appear multiple times, but I donât want full-width spaces in the key phrase.
- To begin with, the newlines are gone at the MeCab stage, but shouldnât the newlines be tokens that never enter the keyphrase?
- of tokens inserted between them âprobability of appearing in a key phrase.â
This page is auto-translated from /nishio/ăăŒăăŹăŒășæœćș2020-08 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.