- With an approach that estimates for each word whether or not to erase it and whether or not to separate it with the right,
 - Thereās an approach to generate a column containing delimited tokens with BERT or something like that.
 
I was thinking of the former, because I donāt want to make it too heavy to run on a server, but this approach would be ruined the moment I try to support rewriting.
Should we try and not avoid the latter with āit might be heavy?ā
What should happen to teacher data?
- So far, weāve documented sporadic examples of bad or human-induced splits.
 - If you want to put it in BERT, maybe a column of delimited token ticks.
 - We thought it would be difficult to express rewriting in the current rule-based version, but we can do string substitution in the preprocessing
 - Then after that, itās usually engraved on the token.
 - Worst case, only rule-based pre-write can be left in the rule base.
 
2021/3/29
- 
Spit out the results of the split with the current rule-based one to a file.
 - 
Put an x on a bad split result and an o on a good split result
- You donāt have to put anything on.
 - The ones with nothing on them can be o as being so-so acceptable.
 
 - 
At least itās better than the current āfailure cases are in the form of memos with no formattingā.
 - 
Hopefully the bad splits will be reduced when the rule is modified.
 - 
I want to merge outputs.
 - 
Machine learning?
- I think it could be within the scope of parameter adjustment.
 
 - 
Even if we feed BERT in the future, Iād like to wait until after a few hundred data have been collected.
 - 
If you keep the data in this form, you can split it up and use it as teacher data if it does not contain ābad examplesā.
 - 
I checked the code.
- JSON output for regression testing to prevent unexpected breakage.
 - This one has a list of āsplit results.ā
 - Itās a list, so you canāt mark āthis result is badā.
 - Regression testing itself will continue to be useful
 - You can have a separate ā dictionary.
 - Right now itās rule-based, so Iām chopping in order of highest score, but if the score is 0.5 or higher, Iāll definitely split it up, and if something is still too long, should I chop it up separately?
 
 - 
Assistance in dividing a long document into stickies: an example of a bad division
 
2021-03-30
- I get the feeling that the type of case particle alone doesnāt seem to determine which case particle feels natural to me when I need to split with any of them.
- I wonder if it depends on the type of case particle next to it and the distanceā¦
 
 
This page is auto-translated from /nishio/ę©ę¢°å¦ēæć§é·ęä»ē®åå² using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iām very happy to spread my thought to non-Japanese readers.