• Run_classifier reading of [Japanese BERT You are reading https://github.com/yoheikikuta/bert-japanese/blob/master/src/run_classifier.py

  • The bert repository of the main house is imported with a git submodule.

  • This run_classifier.py is based on the original run_classifier.py with modifications such as using SentencePiece

  • tf.app.run() calls main

  • main→model_fn_builder→create_model

  • in create_model.

    • model = modeling.BertModel(...)
    • I’m working on a BERT model here.
    • Comment
      • In the demo, we are doing a simple classification task on the entire segment.

      • If you want to use the token-level output, use model.get_sequence_output() instead.

      • Using get_pooled_output()
      • What is this?
        • A simple full concatenation layer that takes the output for the first token as input and produces the hidden_size output python
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
     with tf.variable_scope("pooler"):
       # We "pool" the model by simply taking the hidden state corresponding
       # to the first token. We assume that this has been pre-trained
       first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
       self.pooled_output = tf.layers.dense(
           first_token_tensor,
           config.hidden_size,
           activation=tf.tanh,
           kernel_initializer=create_initializer(config.initializer_range))
    - I'm like, "What? You want the first token? Shouldn't it be the last token of the sentence?" but it is not correct.
        - RNN-like mental models are being dragged out.
        - BERT has a stacked [[self-caution]] structure
            - Each individual self-attention acts as a convolution of indefinite length for the lower layers
            - So it doesn't matter if it's at the beginning or the end of a sentence.
    - I thought the output of the first token would be packed with information about the word itself and the whole sentence, tough.
        - It's not right.
        - It is not a word because there is a CLS token at the beginning
            - `tokens:   [CLS] the dog is hairy . [SEP]`
            - [src](https://github.com/yoheikikuta/bert-japanese/blob/df61ef5065100c8f8df4f01728020e399b4326da/src/run_classifier.py#L283)
    - Discussion of whether the output to the first token can be interpreted as a vector embedding of the sentence.
        - [Features extracted from layer -1 represent sentence embedding for a sentence? · Issue #71 · google-research/bert](https://github.com/google-research/bert/issues/71)
            - Why not use the hidden state of the first token as default strategy, i.e. the `[CLS]`?
                - [Frequently Asked Questions — bert-as-service 1.6.1 documentation](https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-use-the-hidden-state-of-the-first-token-as-default-strategy-i-e-the-cls)
            - Why not the last hidden layer? Why second-to-last?
                - [Frequently Asked Questions — bert-as-service 1.6.1 documentation](https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last)
        - My conclusion came to NO: [[BERT statement vector]].
  python3 -m venv ./venv
  source ./venv/bin/activate
  pip install --upgrade pip
  pip install tensorflow==1.15rc2
  pip install -r ../requirements.txt
  • tokenize

    • tokenization_sentencepiece.FullTokenizer(model_file="../model/wiki-ja.model", vocab_file="../model/wiki-ja.vocab")
    • In [9]: tok.tokenize("Today is sunny", internal server error)
    • Out[9]: ['â–æœŹ', 'æ—„', 'ha', '晎', '怩', 'ăȘり', 'ă‚€ăƒłă‚żăƒŒ', 'nar', 'ă‚”ăƒŒăƒăƒŒ', 'error']
  • bert_config_file

    • —bert_config_file= 
 config.json is not found in the repository.
    • There is a config.ini
    • Generating json from ini at the beginning of run_classifier.py
    • I decided to save this under the name config.json.
  • ValueError: Couldn’t find ‘checkpoint’ file or checkpoints in given directory ../model

    • Place the downloaded model.ckpt-1400000.* in . /model and put it in --init_checkpoint=. /model

    • --init_checkpoint=. /model/model.ckpt-1400000 is correct
  • extract_feature.py worked.

    • The command looks like this
      • $ python3 extract_features.py --model_file=../model/wiki-ja.model --vocab_file=../model/wiki-ja.vocab --input_file=smallinput.txt --bert_config_file=config.json --init_checkpoint=../model/model.ckpt-1400000 --output_file=tmp/output
      • JSON is spit out with the file name specified in —output_file. python
x = json.load(open("tmp/output"))
In [8]: x["features"][2]["token"]                                                                                                                      
Out[8]: 'correct'
In [10]: x["features"][2]["layers"][0]                                                                                                                 
Out[10]: 
{'index': -1,
 'values': [-0.433769, ...]}
    - Each token contains a vector of 768 dimensions.
        - (If you carelessly use it on a large file, you may end up with a very large JSON.)
    - For me, who wants a vector to a statement, can I just take out the vector of the first token of layer -1?
  • read_examples

    • If separated by |||, it is considered as a pair of two sentences, otherwise it is considered as a single sentence
    • I was thinking that if you want to pour in your own data, it would be better to replace the part of main that calls read_examples instead of reading from the text.
      • Because I want to use Scrapbox JSON files as the source data.
  • Imported extract_feature.py and overwrote main

    • 8245 seconds to vector 6343 cases
    • MacBook Pro (15-inch, 2018) / 2.6 GHz Intel Core i7 / 16 GB 2400 MHz DDR4
  • Note on experiment: Link Creation Support.

  • I haven’t tried additional learning and fine-tuning in run_classifier.py yet.

    • If you try it, write to [Japanese BERT fine-tuning

This page is auto-translated from /nishio/æ—„æœŹèȘžBERTたrun_classifierèȘ­è§Ł using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.