-
Run_classifier reading of [Japanese BERT You are reading https://github.com/yoheikikuta/bert-japanese/blob/master/src/run_classifier.py
-
The bert repository of the main house is imported with a git submodule.
- I add it to sys.path and reuse it in import modeling or something.
- The BERT model definition and other information is written there.
- Related: Import is moved before sys.path setting in VSCode.
-
This run_classifier.py is based on the original run_classifier.py with modifications such as using SentencePiece
-
tf.app.run() calls main
-
mainâmodel_fn_builderâcreate_model
-
in create_model.
model = modeling.BertModel(...)
- Iâm working on a BERT model here.
- Imported modeling from the original BERT repository.
- BERTâs network structure is defined there.
- Comment
-
In the demo, we are doing a simple classification task on the entire segment.
-
If you want to use the token-level output, use model.get_sequence_output() instead.
- Using get_pooled_output()
- What is this?
- A simple full concatenation layer that takes the output for the first token as input and produces the hidden_size output python
-
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
- I'm like, "What? You want the first token? Shouldn't it be the last token of the sentence?" but it is not correct.
- RNN-like mental models are being dragged out.
- BERT has a stacked [[self-caution]] structure
- Each individual self-attention acts as a convolution of indefinite length for the lower layers
- So it doesn't matter if it's at the beginning or the end of a sentence.
- I thought the output of the first token would be packed with information about the word itself and the whole sentence, tough.
- It's not right.
- It is not a word because there is a CLS token at the beginning
- `tokens: [CLS] the dog is hairy . [SEP]`
- [src](https://github.com/yoheikikuta/bert-japanese/blob/df61ef5065100c8f8df4f01728020e399b4326da/src/run_classifier.py#L283)
- Discussion of whether the output to the first token can be interpreted as a vector embedding of the sentence.
- [Features extracted from layer -1 represent sentence embedding for a sentence? · Issue #71 · google-research/bert](https://github.com/google-research/bert/issues/71)
- Why not use the hidden state of the first token as default strategy, i.e. the `[CLS]`?
- [Frequently Asked Questions â bert-as-service 1.6.1 documentation](https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-use-the-hidden-state-of-the-first-token-as-default-strategy-i-e-the-cls)
- Why not the last hidden layer? Why second-to-last?
- [Frequently Asked Questions â bert-as-service 1.6.1 documentation](https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last)
- My conclusion came to NO: [[BERT statement vector]].
-
l.728
use_one_hot_embeddings=FLAGS.use_tpu)
Is this correct?- right
- On TPU it is faster, and [src https://github.com/yoheikikuta/bert-japanese/blob/59e306faffe8e77dbf7347c8bb75c09ecfa8a1dc/src/extract_features. py#L80].
-
tf.flags is not in Tensorflow 2.0
AttributeError: module 'tensorflow' has no attribute 'flags'
- Should be replaced by argparse
-
To port to Tensorflow 2, there are rather many modifications, so I decided to create a TF 1 environment with venv because itâs too much trouble :
python3 -m venv ./venv
source ./venv/bin/activate
pip install --upgrade pip
pip install tensorflow==1.15rc2
pip install -r ../requirements.txt
-
tokenize
tokenization_sentencepiece.FullTokenizer(model_file="../model/wiki-ja.model", vocab_file="../model/wiki-ja.vocab")
In [9]: tok.tokenize("Today is sunny", internal server error)
Out[9]: ['âæŹ', 'æ„', 'ha', 'æŽ', '怩', 'ăȘă', 'ă€ăłăżăŒ', 'nar', 'ă”ăŒăăŒ', 'error']
-
bert_config_file
- âbert_config_file= ⊠config.json is not found in the repository.
- There is a config.ini
- Generating json from ini at the beginning of run_classifier.py
- I decided to save this under the name config.json.
-
ValueError: Couldnât find âcheckpointâ file or checkpoints in given directory ../model
- Place the downloaded
model.ckpt-1400000.*
in . /model and put it in--init_checkpoint=. /model
⊠--init_checkpoint=. /model/model.ckpt-1400000
is correct
- Place the downloaded
-
extract_feature.py worked.
- The command looks like this
$ python3 extract_features.py --model_file=../model/wiki-ja.model --vocab_file=../model/wiki-ja.vocab --input_file=smallinput.txt --bert_config_file=config.json --init_checkpoint=../model/model.ckpt-1400000 --output_file=tmp/output
- JSON is spit out with the file name specified in âoutput_file. python
- The command looks like this
x = json.load(open("tmp/output"))
In [8]: x["features"][2]["token"]
Out[8]: 'correct'
In [10]: x["features"][2]["layers"][0]
Out[10]:
{'index': -1,
'values': [-0.433769, ...]}
- Each token contains a vector of 768 dimensions.
- (If you carelessly use it on a large file, you may end up with a very large JSON.)
- For me, who wants a vector to a statement, can I just take out the vector of the first token of layer -1?
-
read_examples
- If separated by
|||
, it is considered as a pair of two sentences, otherwise it is considered as a single sentence - I was thinking that if you want to pour in your own data, it would be better to replace the part of main that calls read_examples instead of reading from the text.
- Because I want to use Scrapbox JSON files as the source data.
- If separated by
-
Imported extract_feature.py and overwrote main
- 8245 seconds to vector 6343 cases
- MacBook Pro (15-inch, 2018) / 2.6 GHz Intel Core i7 / 16 GB 2400 MHz DDR4
-
Note on experiment: Link Creation Support.
-
I havenât tried additional learning and fine-tuning in run_classifier.py yet.
- If you try it, write to [Japanese BERT fine-tuning
This page is auto-translated from /nishio/æ„æŹèȘBERTăźrun_classifierèȘè§Ł using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iâm very happy to spread my thought to non-Japanese readers.