- Scaffolding Network 2022-05-09pScaffoldNetwork MeCab Constrained Parsing
- being hard put to it
- Out-of-range access occurs.
- Trying to read beyond the end of the suffix sequence
- Cause again unknown.
- In the meantime, if you workaround the program so that it doesn’t die, run it all the way to the end.
- However, the notation shaking link is strange.
- Opening words are indeed common
- The ending word should also be common, but it’s not.
- I took a quick look at the output results, and it doesn’t look like the text is corrupted without being restored, so we’re making progress.
- Possible common cause of “out-of-range access” and “shaky links”
- I hear the end is off by one.
- It makes the out-of-range access and also the end of the notation shake wrong.
- It doesn’t seem to be simply that the last word is missing, the next word is not equal.
- I hear the end is off by one.
For data with large duplicate sentences
- For example, there are several different versions of a sentence after polishing it.
- There’s a long common string that spans multiple documents, so I’m assuming that’s the keyword.
- solution
- A large number of common string links will be found between specific sets of documents
- A: Detect it and reduce the score
- B: After extracting one page, skip the pages that “the set of pages to be connected” is covered by it so that it is not extracted.
- I think this will filter out the very long keywords that occur only between cloned sentences, because they won’t appear in other documents.
- C: Change the score calculation formula so that this kind of thing naturally gets a lower score
- To begin with.
- The keyword scoring is done without distinguishing whether it is an exact match substring or an ambiguous match, but there is a proposal to change that.
- Now that we don’t distinguish between the two, if you get an oddly notated shaky keyword, that’s usually a long one, so you’ll get a higher score.
- It’s not right to behave better when things get weird, it should not get weird.
- Even if you do that, the same long string matches will occur in “data that contains repeatedly edited and mis-versioned sentences”.
- Now that we don’t distinguish between the two, if you get an oddly notated shaky keyword, that’s usually a long one, so you’ll get a higher score.
- The keyword scoring is done without distinguishing whether it is an exact match substring or an ambiguous match, but there is a proposal to change that.
- Test the operation of the “MeCab Constrained Analysis” part using another data set that would have no duplicates.
- This place is still not working right.
What links to choose?
- Simply score top N proposal on each page.
- Proposal to exclude the same set of pages linked by the top keywords
- Proposal to have “already linked pages” and skip if included.
- Similar to Scrapbox behavior
- âś…Exclude overlap on strings with top-level keywords
- Add links one at a time, starting with the highest scoring link.
- Spanning Tree Idea
experimental data proposal
- Imported and published from Kitaro Nishida PDF
- Introduction to Philosophy
- Import one book from Drucker books
- This is private.
- Possibility of not working because of OCR
- Conversion from Scrapbox
- Conversion candidates in dry-run, check them, and then convert based on them.
- Put the candidates on file so they can be edited?
- Conversion candidates in dry-run, check them, and then convert based on them.
- Proposal to crawl through a collection of online columns, etc.
This page is auto-translated from [/nishio/pScaffoldNetwork 2022-05-09](https://scrapbox.io/nishio/pScaffoldNetwork 2022-05-09) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.