Attention Is All You Need

NeurIPS 2017 Ashish Vaswani • Noam Shazeer • Niki Parmar • Jakob Uszkoreit • Llion Jones • Aidan N. Gomez • Lukasz Kaiser • Illia Polosukhin

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Full paper

Code


Evaluation


Task Dataset Model Metric name Metric value Global rank Compare
Machine Translation IWSLT2015 English-German Transformer BLEU score 28.23 # 1
Machine Translation IWSLT2015 German-English Transformer BLEU score 34.44 # 1
Constituency Parsing Penn Treebank Transformer F1 score 92.7 # 6
Machine Translation WMT2014 English-French Transformer Base BLEU score 38.1 # 15
Machine Translation WMT2014 English-French Transformer Big BLEU score 41.0 # 9
Machine Translation WMT2014 English-German Transformer Base BLEU score 27.3 # 14
Machine Translation WMT2014 English-German Transformer Big BLEU score 28.4 # 12