TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Abstractive Text Summarization	CNN / Daily Mail	Transformer	ROUGE-1	39.50	# 50
Abstractive Text Summarization	CNN / Daily Mail	Transformer	ROUGE-2	16.06	# 50
Abstractive Text Summarization	CNN / Daily Mail	Transformer	ROUGE-L	36.63	# 45
Text Summarization	GigaWord	Transformer	ROUGE-1	37.57	# 21
Text Summarization	GigaWord	Transformer	ROUGE-2	18.90	# 21
Text Summarization	GigaWord	Transformer	ROUGE-L	34.69	# 23
Machine Translation	IWSLT2014 German-English	Transformer	BLEU score	34.44	# 26
Machine Translation	IWSLT2015 English-German	Transformer	BLEU score	28.50	# 2
Image-guided Story Ending Generation	LSMDC-E	Transformer	BLEU-1	15.35	# 3
Image-guided Story Ending Generation	LSMDC-E	Transformer	BLEU-2	4.49	# 4
Image-guided Story Ending Generation	LSMDC-E	Transformer	BLEU-3	1.82	# 2
Image-guided Story Ending Generation	LSMDC-E	Transformer	BLEU-4	0.76	# 2
Image-guided Story Ending Generation	LSMDC-E	Transformer	METEOR	11.43	# 3
Image-guided Story Ending Generation	LSMDC-E	Transformer	CIDEr	9.32	# 2
Image-guided Story Ending Generation	LSMDC-E	Transformer	ROUGE-L	19.16	# 4
Multimodal Machine Translation	Multi30K	Transformer	BLUE (DE-EN)	29.0	# 2
Natural Language Understanding	PDP60	Subword-level Transformer LM	Accuracy	58.3	# 10
Constituency Parsing	Penn Treebank	Transformer	F1 score	92.7	# 21
Image-guided Story Ending Generation	VIST-E	Transformer	BLEU-1	17.18	# 4
Image-guided Story Ending Generation	VIST-E	Transformer	BLEU-2	6.29	# 3
Image-guided Story Ending Generation	VIST-E	Transformer	BLEU-3	3.07	# 3
Image-guided Story Ending Generation	VIST-E	Transformer	BLEU-4	2.01	# 3
Image-guided Story Ending Generation	VIST-E	Transformer	METEOR	6.91	# 3
Image-guided Story Ending Generation	VIST-E	Transformer	CIDEr	12.75	# 4
Image-guided Story Ending Generation	VIST-E	Transformer	ROUGE-L	18.23	# 4
Coreference Resolution	Winograd Schema Challenge	Subword-level Transformer LM	Accuracy	54.1	# 73
Machine Translation	WMT2014 English-French	Transformer Big	BLEU score	41.0	# 26
Machine Translation	WMT2014 English-French	Transformer Big	Hardware Burden	23G	# 1
Machine Translation	WMT2014 English-French	Transformer Big	Operations per network pass	2300000000.0G	# 1
Machine Translation	WMT2014 English-French	Transformer Base	BLEU score	38.1	# 39
Machine Translation	WMT2014 English-French	Transformer Base	Hardware Burden	23G	# 1
Machine Translation	WMT2014 English-French	Transformer Base	Operations per network pass	330000000.0G	# 1
Machine Translation	WMT2014 English-German	Transformer Base	BLEU score	27.3	# 52
Machine Translation	WMT2014 English-German	Transformer Base	Operations per network pass	330000000.0G	# 1
Machine Translation	WMT2014 English-German	Transformer Big	BLEU score	28.4	# 44
Machine Translation	WMT2014 English-German	Transformer Big	Hardware Burden	871G	# 1
Machine Translation	WMT2014 English-German	Transformer Big	Operations per network pass	2300000000.0G	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/machine-translation-on-iwslt2015-english)](https://paperswithcode.com/sota/machine-translation-on-iwslt2015-english?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/multimodal-machine-translation-on-multi30k)](https://paperswithcode.com/sota/multimodal-machine-translation-on-multi30k?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/image-guided-story-ending-generation-on-lsmdc)](https://paperswithcode.com/sota/image-guided-story-ending-generation-on-lsmdc?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/image-guided-story-ending-generation-on-vist)](https://paperswithcode.com/sota/image-guided-story-ending-generation-on-vist?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/natural-language-understanding-on-pdp60)](https://paperswithcode.com/sota/natural-language-understanding-on-pdp60?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/text-summarization-on-gigaword)](https://paperswithcode.com/sota/text-summarization-on-gigaword?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/constituency-parsing-on-penn-treebank)](https://paperswithcode.com/sota/constituency-parsing-on-penn-treebank?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/machine-translation-on-iwslt2014-german)](https://paperswithcode.com/sota/machine-translation-on-iwslt2014-german?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/machine-translation-on-wmt2014-english-french)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-french?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/machine-translation-on-wmt2014-english-german)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-german?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/abstractive-text-summarization-on-cnn-daily)](https://paperswithcode.com/sota/abstractive-text-summarization-on-cnn-daily?p=attention-is-all-you-need)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/attention-is-all-you-need/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=attention-is-all-you-need)`

Attention Is All You Need

NeurIPS 2017 · Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin ·

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

PDF Abstract NeurIPS 2017 PDF NeurIPS 2017 Abstract

Code

Add Remove Mark official

tensorflow/tensor2tensor official

↳ Quickstart in

Colab

14,919

huggingface/transformers

125,545

labmlai/annotated_deep_learning_pap…

↳ View annotated code at

labml.ai

48,593

facebookresearch/fairseq

29,322

karpathy/minGPT

18,936

See all 567 implementations

Tasks

Add Remove

Abstractive Text Summarization

Coreference Resolution

Decoder

Few-Shot 3D Point Cloud Classification

Image-guided Story Ending Generation

Link Prediction

Machine Translation

Multimodal Machine Translation

Natural Language Understanding

Question Answering

Speech Emotion Recognition

Text Summarization

Translation

Datasets

Penn Treebank

CNN/Daily Mail

WSC

WMT 2014 Multi30K

Multi30k VIST-E LSMDC-E

Results from the Paper

Edit

Ranked #2 on Multimodal Machine Translation on Multi30K (BLUE (DE-EN) metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Machine Translation	IWSLT2014 German-English	Transformer	BLEU score	34.44	# 26	Compare
Machine Translation	IWSLT2015 English-German	Transformer	BLEU score	28.50	# 2	Compare
Image-guided Story Ending Generation	LSMDC-E	Transformer	BLEU-1	15.35	# 3	Compare
			BLEU-2	4.49	# 4	Compare
			BLEU-3	1.82	# 2	Compare
			BLEU-4	0.76	# 2	Compare
			METEOR	11.43	# 3	Compare
			CIDEr	9.32	# 2	Compare
			ROUGE-L	19.16	# 4	Compare
Multimodal Machine Translation	Multi30K	Transformer	BLUE (DE-EN)	29.0	# 2	Compare
Constituency Parsing	Penn Treebank	Transformer	F1 score	92.7	# 21	Compare
Image-guided Story Ending Generation	VIST-E	Transformer	BLEU-1	17.18	# 4	Compare
			BLEU-2	6.29	# 3	Compare
			BLEU-3	3.07	# 3	Compare
			BLEU-4	2.01	# 3	Compare
			METEOR	6.91	# 3	Compare
			CIDEr	12.75	# 4	Compare
			ROUGE-L	18.23	# 4	Compare
Machine Translation	WMT2014 English-French	Transformer Base	BLEU score	38.1	# 39	Compare
			Hardware Burden	23G	# 1	Compare
			Operations per network pass	330000000.0G	# 1	Compare
Machine Translation	WMT2014 English-German	Transformer Base	BLEU score	27.3	# 52	Compare
Machine Translation	WMT2014 English-German	Transformer Base	Operations per network pass	330000000.0G	# 1	Compare
Machine Translation	WMT2014 English-German	Transformer Big	BLEU score	28.4	# 44	Compare
			Hardware Burden	871G	# 1	Compare
			Operations per network pass	2300000000.0G	# 1	Compare

Results from Other Papers

Task	Dataset	Model	Metric Name	Metric Value	Rank	Compare
Abstractive Text Summarization	CNN / Daily Mail	Transformer	ROUGE-1	39.50	# 50	See all
			ROUGE-2	16.06	# 50	See all
			ROUGE-L	36.63	# 45	See all
Text Summarization	GigaWord	Transformer	ROUGE-1	37.57	# 21	See all
			ROUGE-2	18.90	# 21	See all
			ROUGE-L	34.69	# 23	See all
Natural Language Understanding	PDP60	Subword-level Transformer LM	Accuracy	58.3	# 10	See all
Coreference Resolution	Winograd Schema Challenge	Subword-level Transformer LM	Accuracy	54.1	# 73	See all
Machine Translation	WMT2014 English-French	Transformer Big	BLEU score	41.0	# 26	See all
			Hardware Burden	23G	# 1	See all
			Operations per network pass	2300000000.0G	# 1	See all

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Attention Is All You Need

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit