TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Document Summarization	CNN / Daily Mail	Synthesizer (R+V)	ROUGE-1	38.57	# 23
Document Summarization	CNN / Daily Mail	Synthesizer (R+V)	ROUGE-2	16.24	# 21
Document Summarization	CNN / Daily Mail	Synthesizer (R+V)	ROUGE-L	35.95	# 23
Linguistic Acceptability	CoLA Dev	Synthesizer (R+V)	Accuracy	53.3	# 5
Semantic Textual Similarity	MRPC Dev	Synthesizer (R+V)	Accuracy	91.2	# 1
Dialogue Generation	Persona-Chat	Synthesizer (R+V)	BLEU-1	14.7	# 1
Dialogue Generation	Persona-Chat	Synthesizer (R+V)	ROUGE-L	14.79	# 1
Dialogue Generation	Persona-Chat	Synthesizer (R+V)	METEOR	6.39	# 1
Dialogue Generation	Persona-Chat	Synthesizer (R+V)	CIDr	19.09	# 1
Machine Translation	WMT2014 English-French	Synthesizer (Random + Vanilla)	BLEU score	41.85	# 20
Machine Translation	WMT2014 English-French	Synthesizer (Random + Vanilla)	Hardware Burden	None	# 1
Machine Translation	WMT2014 English-French	Synthesizer (Random + Vanilla)	Operations per network pass	None	# 1
Machine Translation	WMT2014 English-German	Synthesizer (Random + Vanilla)	BLEU score	28.47	# 43
Machine Translation	WMT2014 English-German	Synthesizer (Random + Vanilla)	Hardware Burden	None	# 1
Machine Translation	WMT2014 English-German	Synthesizer (Random + Vanilla)	Operations per network pass	None	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/synthesizer-rethinking-self-attention-in/semantic-textual-similarity-on-mrpc-dev)](https://paperswithcode.com/sota/semantic-textual-similarity-on-mrpc-dev?p=synthesizer-rethinking-self-attention-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/synthesizer-rethinking-self-attention-in/dialogue-generation-on-persona-chat-1)](https://paperswithcode.com/sota/dialogue-generation-on-persona-chat-1?p=synthesizer-rethinking-self-attention-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/synthesizer-rethinking-self-attention-in/linguistic-acceptability-on-cola-dev)](https://paperswithcode.com/sota/linguistic-acceptability-on-cola-dev?p=synthesizer-rethinking-self-attention-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/synthesizer-rethinking-self-attention-in/machine-translation-on-wmt2014-english-french)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-french?p=synthesizer-rethinking-self-attention-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/synthesizer-rethinking-self-attention-in/document-summarization-on-cnn-daily-mail)](https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail?p=synthesizer-rethinking-self-attention-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/synthesizer-rethinking-self-attention-in/machine-translation-on-wmt2014-english-german)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-german?p=synthesizer-rethinking-self-attention-in)`

Synthesizer: Rethinking Self-Attention in Transformer Models

2 May 2020 · Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng ·

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only $60\%$ faster but also improves perplexity by a relative $3.5\%$. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.

PDF Abstract

Code

Add Remove Mark official

10-zin/Synthesizer

Tasks

Add Remove

Abstractive Text Summarization

Dialogue Generation

Document Summarization

Language Modelling

Linguistic Acceptability

Machine Translation

Semantic Textual Similarity

Text Generation

Translation

Datasets

MRPC

CoLA

CNN/Daily Mail

WMT 2014

PERSONA-CHAT

Results from the Paper

Edit

Ranked #1 on Dialogue Generation on Persona-Chat (BLEU-1 metric, using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Document Summarization	CNN / Daily Mail	Synthesizer (R+V)	ROUGE-1	38.57	# 23	Compare
			ROUGE-2	16.24	# 21	Compare
			ROUGE-L	35.95	# 23	Compare
Linguistic Acceptability	CoLA Dev	Synthesizer (R+V)	Accuracy	53.3	# 5	Compare
Semantic Textual Similarity	MRPC Dev	Synthesizer (R+V)	Accuracy	91.2	# 1	Compare
Dialogue Generation	Persona-Chat	Synthesizer (R+V)	BLEU-1	14.7	# 1	Compare
			ROUGE-L	14.79	# 1	Compare
			METEOR	6.39	# 1	Compare
			CIDr	19.09	# 1	Compare
Machine Translation	WMT2014 English-French	Synthesizer (Random + Vanilla)	BLEU score	41.85	# 20	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare
Machine Translation	WMT2014 English-German	Synthesizer (Random + Vanilla)	BLEU score	28.47	# 43	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dense Synthesized Attention • Dropout • Factorized Dense Synthesized Attention • Factorized Random Synthesized Attention • Feedforward Network • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Random Synthesized Attention • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Synthesizer • Transformer

Edit Social Preview

Synthesizer: Rethinking Self-Attention in Transformer Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove