A report that Transformer without RNN and without CNN, consisting only of attention mechanism, performs well in the translation task.

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Commentary 2017-12 http://deeplearning.hatenablog.com/entry/transformer - attention mechanism - Attention mechanism is a dictionary object - view of addition and internal volume caution - Source Target Attention and self-caution - reduced inner product note (Scaled Dot-Product Attention)


This page is auto-translated from /nishio/Transformer using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.