-
attention (heed) (Attention) Generalization as of 2018
-
-
There are queries and Keys, which are a bunch of multiple keys.
-
There is a function F that takes a query and Keys as arguments and returns the intensity of attention for each key.
-
The results are then normalized in some way to sum to 1 to obtain the attention intensity (roughly softmax, see Hard attention mechanism).
-
Weighted average Values by their attention intensity
-
schematic
-
F does not know the number of Key. does not depend on the shape of Key.
- I don’t know how to express this in mathematical terms.
- There is a function f that takes one query and one key, and
[f(query, key) for key in Keys]
.
2014 view of addition 1409.0473 Neural Machine Translation by Jointly Learning to Align and Translate
- Func := Feed-Forward Network
-
By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixedlength vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.
- The hidden state of the RNN is a fixed-length vector, and it is a burden to remember to pack the entire sentence’s data into it
- Attention mechanism can retrieve information from data of arbitrary length, thus reducing its burden
2015 internal volume caution [1508.04025 Effective Approaches to Attention-based Neural Machine Translation https://arxiv.org/abs/1508. 04025]
-
Split that query and key are simply inner products of query and key
-
-
Of course, this inner product is sometimes expressed as a matrix product in some papers.
-
Related bilinear.
-
Initially, the attention mechanism was envisioned to be used in combination with RNNs
-
Store the Encoder’s hidden states in the Encoder-Decoder configuration and select from among those hidden states by the attention mechanism
-
In this configuration, Key and Value come from Encoder and query comes from Decoder.
-
This type of configuration is called [Source Target Attention
- (Sequence Generation with Target Attention (2017) for a comparative discussion in the form of source-target attention and target- target attention in the form of comparative discussion.)
-
K and V together are called Memory
-
Synonyms are self-caution (Self-attention)
- This one is Key, Value, query all are self… no, that definition is not balanced by the level of abstraction…
- It may eventually differentiate into better terms.
- So far, one implementation example is “everything comes from the lower layers.”
- In this form, it’s a development of CNN.
- CNNs that could only accept fixed-length input were replaced by attention mechanism that could accept indefinite-length input.
- Commentary on this replacement: CNN and self-attention.
Old commentary
- This commentary implicitly assumes RNN and is not generalized: it is not a generalization.
- Save past hidden state.
- Create a scalar that represents the appropriate intensity of attention from “the current hiding state and its hidden states.”
- Normalize that scalar to a total of 1
- Used for a weighted average of each hidden state
- There is a way to use output layer values instead of past hidden states.
This page is auto-translated from /nishio/注意機構 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.