- 
Run_classifier reading of [Japanese BERT You are reading https://github.com/yoheikikuta/bert-japanese/blob/master/src/run_classifier.py
 - 
The bert repository of the main house is imported with a git submodule.
- I add it to sys.path and reuse it in import modeling or something.
 - The BERT model definition and other information is written there.
 - Related: Import is moved before sys.path setting in VSCode.
 
 - 
This run_classifier.py is based on the original run_classifier.py with modifications such as using SentencePiece
 - 
tf.app.run() calls main
 - 
mainâmodel_fn_builderâcreate_model
 - 
in create_model.
model = modeling.BertModel(...)- Iâm working on a BERT model here.
- Imported modeling from the original BERT repository.
 - BERTâs network structure is defined there.
 
 - Comment
- 
In the demo, we are doing a simple classification task on the entire segment.
 - 
If you want to use the token-level output, use model.get_sequence_output() instead.
 - Using get_pooled_output()
 - What is this?
- A simple full concatenation layer that takes the output for the first token as input and produces the hidden_size output python
 
 
 - 
 
 
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
     with tf.variable_scope("pooler"):
       # We "pool" the model by simply taking the hidden state corresponding
       # to the first token. We assume that this has been pre-trained
       first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
       self.pooled_output = tf.layers.dense(
           first_token_tensor,
           config.hidden_size,
           activation=tf.tanh,
           kernel_initializer=create_initializer(config.initializer_range))
    - I'm like, "What? You want the first token? Shouldn't it be the last token of the sentence?" but it is not correct.
        - RNN-like mental models are being dragged out.
        - BERT has a stacked [[self-caution]] structure
            - Each individual self-attention acts as a convolution of indefinite length for the lower layers
            - So it doesn't matter if it's at the beginning or the end of a sentence.
    - I thought the output of the first token would be packed with information about the word itself and the whole sentence, tough.
        - It's not right.
        - It is not a word because there is a CLS token at the beginning
            - `tokens:   [CLS] the dog is hairy . [SEP]`
            - [src](https://github.com/yoheikikuta/bert-japanese/blob/df61ef5065100c8f8df4f01728020e399b4326da/src/run_classifier.py#L283)
    - Discussion of whether the output to the first token can be interpreted as a vector embedding of the sentence.
        - [Features extracted from layer -1 represent sentence embedding for a sentence? · Issue #71 · google-research/bert](https://github.com/google-research/bert/issues/71)
            - Why not use the hidden state of the first token as default strategy, i.e. the `[CLS]`?
                - [Frequently Asked Questions â bert-as-service 1.6.1 documentation](https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-use-the-hidden-state-of-the-first-token-as-default-strategy-i-e-the-cls)
            - Why not the last hidden layer? Why second-to-last?
                - [Frequently Asked Questions â bert-as-service 1.6.1 documentation](https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last)
        - My conclusion came to NO: [[BERT statement vector]].
- 
l.728
use_one_hot_embeddings=FLAGS.use_tpu)Is this correct?- right
 - On TPU it is faster, and [src https://github.com/yoheikikuta/bert-japanese/blob/59e306faffe8e77dbf7347c8bb75c09ecfa8a1dc/src/extract_features. py#L80].
 
 - 
tf.flags is not in Tensorflow 2.0
AttributeError: module 'tensorflow' has no attribute 'flags'- Should be replaced by argparse
 
 - 
To port to Tensorflow 2, there are rather many modifications, so I decided to create a TF 1 environment with venv because itâs too much trouble :
 
  python3 -m venv ./venv
  source ./venv/bin/activate
  pip install --upgrade pip
  pip install tensorflow==1.15rc2
  pip install -r ../requirements.txt
- 
tokenize
tokenization_sentencepiece.FullTokenizer(model_file="../model/wiki-ja.model", vocab_file="../model/wiki-ja.vocab")In [9]: tok.tokenize("Today is sunny", internal server error)Out[9]: ['âæŹ', 'æ„', 'ha', 'æŽ', '怩', 'ăȘă', 'ă€ăłăżăŒ', 'nar', 'ă”ăŒăăŒ', 'error']
 - 
bert_config_file
- âbert_config_file= ⊠config.json is not found in the repository.
 - There is a config.ini
 - Generating json from ini at the beginning of run_classifier.py
 - I decided to save this under the name config.json.
 
 - 
ValueError: Couldnât find âcheckpointâ file or checkpoints in given directory ../model
- Place the downloaded 
model.ckpt-1400000.*in . /model and put it in--init_checkpoint=. /model⊠--init_checkpoint=. /model/model.ckpt-1400000is correct
 - Place the downloaded 
 - 
extract_feature.py worked.
- The command looks like this
$ python3 extract_features.py --model_file=../model/wiki-ja.model --vocab_file=../model/wiki-ja.vocab --input_file=smallinput.txt --bert_config_file=config.json --init_checkpoint=../model/model.ckpt-1400000 --output_file=tmp/output- JSON is spit out with the file name specified in âoutput_file. python
 
 
 - The command looks like this
 
x = json.load(open("tmp/output"))
In [8]: x["features"][2]["token"]                                                                                                                      
Out[8]: 'correct'
In [10]: x["features"][2]["layers"][0]                                                                                                                 
Out[10]: 
{'index': -1,
 'values': [-0.433769, ...]}
    - Each token contains a vector of 768 dimensions.
        - (If you carelessly use it on a large file, you may end up with a very large JSON.)
    - For me, who wants a vector to a statement, can I just take out the vector of the first token of layer -1?
- 
read_examples
- If separated by 
|||, it is considered as a pair of two sentences, otherwise it is considered as a single sentence - I was thinking that if you want to pour in your own data, it would be better to replace the part of main that calls read_examples instead of reading from the text.
- Because I want to use Scrapbox JSON files as the source data.
 
 
 - If separated by 
 - 
Imported extract_feature.py and overwrote main
- 8245 seconds to vector 6343 cases
 - MacBook Pro (15-inch, 2018) / 2.6 GHz Intel Core i7 / 16 GB 2400 MHz DDR4
 
 - 
Note on experiment: Link Creation Support.
 - 
I havenât tried additional learning and fine-tuning in run_classifier.py yet.
- If you try it, write to [Japanese BERT fine-tuning
 
 
This page is auto-translated from /nishio/æ„æŹèȘBERTăźrun_classifierèȘè§Ł using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.