Cybozu Labs Machine Learning Study Session 2019-04-05
-
Weâll look back at what weâve done so far and then weâll talk about current issues and ambiguous searches.
-
Weâve done a lot of things and weâre going to take a look back and sort it out.
- Pointwise Estimation by [KyTea
- most significant substring approach
- SentencePiece
- TextRank
- Pointwise Estimation by [KyTea
KyTea
- A system that divides a string into words using pointwise estimation of whether a word is a word boundary or not.
- was tinkered with around January 2015.
- I was wondering if this could be used to estimate the boundaries of key phrases
- In the process of this study, understood the concept of Active Learning (01/30/2015).
- 2015/03/11
-
I was writing about keyphrase extraction using KyTea at a machine learning workshop on Friday, and while I was writing, I got the feeling that I should do a dependency analysisâŠ
-
- After this, Kaggle gets excited and pends for a time.
Review of April 2017
Try KyTea to do keyword extraction that is not word-by-word in MaCab. The approach of trying to extract keywords with KyTea was not correct.
Problem with preparing sufficient training data I think it would have been better to add rule-based âwords connectedâ to the morphological analysis by MeCab to the keywords. Iâm working on a âMeCab chopped and then extreme substringâ approach thatâs similar to that.
maximized substring - An approach that applies the idea of most significant substring to a sequence of words rather than a sequence of characters. - Extreme substring = a substring such that there is no other string that contains it with the same number of occurrences.
- There are algorithms that can extract substrings that appear repeatedly.
- Suffix array - Wikipedia(SA-IS method)
- All substrings that occur more than N timesâ and so on can be efficiently retrieved.
- âA key phrase is a sequence of words that appears repeatedly, so this could be used.
- I was doing this around April 2017.
result
- I did indeed extract the long strings of words I was hoping for.
- But a lot of other things were extracted as well.
- Active Learning to learn âDoes it look like a key phrase?â filter.
- Results of the experiment around April 2017
-
Higher score (excluding teacher data)
-
âbiologicalâ (37 times, 88.0%) âsynthetic inhibitorâ (20 times, 88.0%) âmedial temporal lobeâ (28 times, 88.0%)
-
Lower score
-
âwill be (figure)â (5 times, 14.7%) âafterâ (6 times, 14.4%) âifâ (5 times, 14.4%)
-
- After this, the Deep Learning course was so exciting that it was put on hold for a while.
Monday, May 1, 2017 17:32 Experiment with keyphrase extraction into Scrapbox
Summary of experiments with each page of the book as the target object ă» Chapter headings tend to be extracted as frequent key phrases ă»It is interesting that there are examples of key phrases included in chapter headings appearing together in different places than in the chapter. ă»Maybe we can combine the appearances on consecutive pages into one. ă»It is not good that you canât read the text even if you click the link and jump to the page of the target object. ă»Technically it is possible to embed the text there, or embed images of the page, but it is not copyrightable, so it cannot be published. âYou can try with âPresentation Slide PDFâ which you own the copyright.
- âAll appearances on successive pages are combined into one.â
- 2020 postscript, this is page-level DF.
- Not very useful since it is obvious that it appears repeatedly (foreshadowing).
word segmentation
- I used MeCab to split the words beforehand.
- Muddled in processing key phrases that contain blanks in a way that can properly restore the blanks.
- The question of whether the word segmentation approach is the right one to begin withâŠ
- The Problem of Unknown Words
- Language model-based tokenizer
- Reversible Text Segmentation
- Eliminates the problem of unknown words.
- The concept of grammatical words is irrelevant.
- Use the suffix array to extract the top 1 million occurrences of substrings and then reduce the vocabulary.
- 2018/12/18
-
When I plugged in a corpus of 1700 character types of âThe Art of Intellectual Production of Engineersâ and had it divided into 10000, itâs good that it recognizes âThe Art of Intellectual Production of Engineersâ as a single unit of mass, etc.
- But what comes out is of course â10,000 tokens that can express that sentence,â so it takes some effort to pick out key phrases from them that are interesting to humans.
-
Based on, valid, based on, output example, paste, room cleanup, too little, inexperienced, coding, supervise, shrink, history, finally, radar chart, continuous, Dutch, move near related, compare, top down, one dimensional reading experience, â Bacon, double fast, wall, mountain, gentle, refactoring, photos, handy, technical review, professor, mature, pull out, seen in chapter, display code, SMART, gather all first
-
- After this, I started working on the English version of the Intellectual Production Techniques of Engineers, and once pending
- April 2019 (i.e., now)
- PositionRank (2017 paper)
- Multiply the Google PageRank algorithm on a graph of word sequences.
- So far, the same as in the previous study TextRank (2004)
- A paper that says that when a modification based on the position of word occurrence was added to this, it won the TF-IDF type keyphrase extraction.
- Iâm not sure why it works, so I implemented it from TextRank.
- Create a graph based on word adjacencies and pagerank
- Eigenvalue decomposition by creating a dens matrix of transition probabilities
- It takes a little over 1 second for 1000 vertices (MacBook)
- Naturally the Rank of many words and adjacent words will be higher?
- âThatâs right, if you do it normally, the âhaâ etc. would be higher.
- Original paper filters out all but nouns and adjectives
- âSubtle, as Scrapbox Statistics 2019-2 shows that key phrases containing other than nouns and adjectives are used about 10-30% of the time.
TextRank: What is your multi-word key phrase?
- This is ultimately a âscore the words in some wayâ method
- Iâm using PageRank for that âmethodâ.
- So how do you find key phrases that consist of multiple words?
- The top 1/3 of the score is used as candidate key phrases, and if they are adjacent to each other in the source text, they are combined.
- This could use some more workâŠ
- For example, âLean Startupâ is divided into âLeanâ and âStartupâ because the middle black is a symbol, but it seems that these can be connected regardless of what the score of the middle black is.
TextRank: good results if limited to nouns?
-
For, thing, like, where, it, if âŠ
- Frequent nouns are extracted as key phrases.
- Do you put them on the blacklist?
- I mean, IDFâŠ
TextRank and TF-IDF
-
After all, the great thing about TextRank is that it doesnât require anything other than the document to be processed.
- I donât even need information about the IDF.
- But you use language-specific knowledge to extract only nounsâŠ
-
In a kintone or Scrapbox-like use case, using information on high-frequency words is not a real problem.
- Iâd rather use âinformation from other textsâ and especially with Scrapbox, âinformation from key phrases that have already been manually added.
- I want to improve incrementally.
-
Approaches to using tfidf
-
A method in which the phrase score is the sum of the TF-IDFs of each word that makes up the phrase score
-
Use only the longest noun phrase as a candidate key phrase.
- Phrase-based TF-IDF: Application of Noun Phrase Analysis (2013), which employs a mechanism to assign scores not only to the longest noun phrase, but also to partial noun phrases
- Better for longer sentences averaging 8,000 words, but worse for shorter sentences averaging 134 words
-
The use case for the Scrapbox link:. - It is not beneficial to join a link with many participants - Links to faraway places is good
- Even if they donât connect, itâs good because it gives room for future connections.
-
in other words
- If N is the number of participants in the link, does it cost 1/N to score?
- This is almost the same idea as IDF, except instead of word Frequency, it is phrase Frequency
- Define âfar.â
- On Scrapbox, the link is already displayed as a link up to 2 hops away on the existing link.
- It is preferable to link to a page 3 hops or more ahead that is not included there
- If N is the number of participants in the link, does it cost 1/N to score?
-
In the case of Scrapbox, the distance between pages is determined based on the connection relationship of links
- This is because links between contents are manually assigned in advance.
-
Not so with use cases such as kintone?
- Not really.
- Each record in the app has a distance difference depending on whether it is the âsame appâ or not.
- Data that has a hierarchical structure has a hierarchical inclusion relationship as a link to determine the distance.
-
Consideration Auto Bracketing.
What we cut down
- Even in Scrapbox, âother projectsâ and ânewly added contentâ are distance â.
- Would you prefer a system that can discover links between these things?
This page is auto-translated from /nishio/ăăŒăăŹăŒășæœćș2019-04-02 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.