- With an approach that estimates for each word whether or not to erase it and whether or not to separate it with the right,
- Thereās an approach to generate a column containing delimited tokens with BERT or something like that.
I was thinking of the former, because I donāt want to make it too heavy to run on a server, but this approach would be ruined the moment I try to support rewriting.
Should we try and not avoid the latter with āit might be heavy?ā
What should happen to teacher data?
- So far, weāve documented sporadic examples of bad or human-induced splits.
- If you want to put it in BERT, maybe a column of delimited token ticks.
- We thought it would be difficult to express rewriting in the current rule-based version, but we can do string substitution in the preprocessing
- Then after that, itās usually engraved on the token.
- Worst case, only rule-based pre-write can be left in the rule base.
2021/3/29
-
Spit out the results of the split with the current rule-based one to a file.
-
Put an x on a bad split result and an o on a good split result
- You donāt have to put anything on.
- The ones with nothing on them can be o as being so-so acceptable.
-
At least itās better than the current āfailure cases are in the form of memos with no formattingā.
-
Hopefully the bad splits will be reduced when the rule is modified.
-
I want to merge outputs.
-
Machine learning?
- I think it could be within the scope of parameter adjustment.
-
Even if we feed BERT in the future, Iād like to wait until after a few hundred data have been collected.
-
If you keep the data in this form, you can split it up and use it as teacher data if it does not contain ābad examplesā.
-
I checked the code.
- JSON output for regression testing to prevent unexpected breakage.
- This one has a list of āsplit results.ā
- Itās a list, so you canāt mark āthis result is badā.
- Regression testing itself will continue to be useful
- You can have a separate ā dictionary.
- Right now itās rule-based, so Iām chopping in order of highest score, but if the score is 0.5 or higher, Iāll definitely split it up, and if something is still too long, should I chop it up separately?
-
Assistance in dividing a long document into stickies: an example of a bad division
2021-03-30
- I get the feeling that the type of case particle alone doesnāt seem to determine which case particle feels natural to me when I need to split with any of them.
- I wonder if it depends on the type of case particle next to it and the distanceā¦
This page is auto-translated from /nishio/ę©ę¢°å¦ēæć§é·ęä»ē®åå² using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iām very happy to spread my thought to non-Japanese readers.