image

  • With an approach that estimates for each word whether or not to erase it and whether or not to separate it with the right,
  • Thereā€™s an approach to generate a column containing delimited tokens with BERT or something like that.

I was thinking of the former, because I donā€™t want to make it too heavy to run on a server, but this approach would be ruined the moment I try to support rewriting.

Should we try and not avoid the latter with ā€œit might be heavy?ā€

What should happen to teacher data?

  • So far, weā€™ve documented sporadic examples of bad or human-induced splits.
  • If you want to put it in BERT, maybe a column of delimited token ticks.
  • We thought it would be difficult to express rewriting in the current rule-based version, but we can do string substitution in the preprocessing
  • Then after that, itā€™s usually engraved on the token.
  • Worst case, only rule-based pre-write can be left in the rule base.

2021/3/29

  • Spit out the results of the split with the current rule-based one to a file.

  • Put an x on a bad split result and an o on a good split result

    • You donā€™t have to put anything on.
    • The ones with nothing on them can be o as being so-so acceptable.
  • At least itā€™s better than the current ā€œfailure cases are in the form of memos with no formattingā€.

  • Hopefully the bad splits will be reduced when the rule is modified.

  • I want to merge outputs.

  • Machine learning?

    • I think it could be within the scope of parameter adjustment.
  • Even if we feed BERT in the future, Iā€™d like to wait until after a few hundred data have been collected.

  • If you keep the data in this form, you can split it up and use it as teacher data if it does not contain ā€œbad examplesā€.

  • I checked the code.

    • JSON output for regression testing to prevent unexpected breakage.
    • This one has a list of ā€œsplit results.ā€
    • Itā€™s a list, so you canā€™t mark ā€œthis result is badā€.
    • Regression testing itself will continue to be useful
    • You can have a separate āœ… dictionary.
    • Right now itā€™s rule-based, so Iā€™m chopping in order of highest score, but if the score is 0.5 or higher, Iā€™ll definitely split it up, and if something is still too long, should I chop it up separately?
  • Assistance in dividing a long document into stickies: an example of a bad division

2021-03-30

  • I get the feeling that the type of case particle alone doesnā€™t seem to determine which case particle feels natural to me when I need to split with any of them.
    • I wonder if it depends on the type of case particle next to it and the distanceā€¦

This page is auto-translated from /nishio/ę©Ÿę¢°å­¦ēæ’ć§é•·ę–‡ä»˜ē®‹åˆ†å‰² using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iā€™m very happy to spread my thought to non-Japanese readers.