-
We have a lot of language data, and we want to support the use of this data by computer.
-
The current well-known methods are search and recommendation, and the proposal method is a nice combination of the two.
-
Search is a system in which a human enters a short âsearch termâ and documents containing that term are pointed to.
- Humans must come up with âsearch keywordsâ.
-
The suggestion system allows long sentences as input. Part of this long input becomes âsearch keywordsâ.
- If we tried to force the existing search system to achieve this, we would have to chop long documents into small pieces and search them over and over again, which is time consuming. I made some technical innovations to make it work in a realistic amount of time. For example, it takes less than one second to search my Scrapbox, which has 6500 pages, with 10,000 characters.
- There is no need to create keywords just for searching. You can use it in such ways as âenter a sentence you are about to write and find related articles.â
-
I actually put this paragraph into the proposed system: sample1.
-
Relationship to Recommendations.
- Recommendations, for example, can be imagined as other articles being displayed as ârelated articlesâ at the end of an article.
- Thatâs a system that takes a âlong articleâ as input and points to âother articles that might be relevantâ.
- However, existing systems indicate âarticles that are relatedâ but do not often indicate âhow they are related. Therefore, humans cannot know what exactly is the relationship between the articles that the system points to as ârelevant.
-
The suggestion system can show âwhat phrases are common. The system can show, âThis phrase in this document is also in that sentence, so I think they are related.
-
Existing recommendation systems can do something similar. A word frequency-based method could display âthis word is common.
-
However, when it is a word rather than a phrase, often humans fail to understand the meaning. For example, âGeneral Managerâs Associationâ would be inscribed as âHon/Department Manager/Association. Even if it is displayed as âThe word âgeneral managerâ is common to this article,â the information is too reduced to be understood.
-
The proposed system presents common parts in phrases (sequences of multiple words) rather than words.
-
The most similar existing system to this systemâs output is Scrapboxâs linked page display, in which a human explicitly brackets (brackets) a phrase and documents in which that phrase commonly appears are displayed at the end of the page as âlinkedâ.
-
The Scrapbox method requires a human being to have bracketed a large number of documents in advance. The proposed system is a method that generates Scrapbox-like links to a pile of unmarketed documents.
-
Detailed appeal points
- If it is as long as one tweet, the result will return in the order of milliseconds. Explosion speed.
- If integrated with a document writing tool that does not require active thought to search, recommendations can be made using the content of the document as it is being written while it is being created. Without having to switch the mode of mind to âthink of search keywordsâ while writing, you can find common phrases in a huge stock of documents.
- Ambiguous search. The search ignores verb conjugations, presence or absence of particles, and capitalization of English letters, thus absorbing notational quirks. For example, âinformation sharingâ and âinformation sharing,â âinformation sharingâ and âinformation is sharedâ are all mutual search hits.
----- old A system that performs an ambiguous search of all 6500 pages in my Scrapbox for each edit and presents pages that have key phrases in common with the text you entered.
Search vaguely
- The major difference from the common ârelated article recommendationâ is that ârelatedâ has a human-understandable label. This is quite sympathetic to Scrapboxâs philosophy. For example, in the above example, if there is no name for the ârelated,â the ârelated to probability integralâ label would be âWhy? in the example above. With this method, âDX is commonâ is displayed, so we can understand that âthat dx is a symbol for integration.
This page is auto-translated from /nishio/ScrapboxăźăȘăłăŻăă”ăžă§ăčă using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.