TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	enwik8	All-attention network (18 layers)	Bit per Character (BPC)	1.01	# 19
Language Modelling	enwik8	All-attention network (18 layers)	Number of params	39M	# 30
Language Modelling	enwik8	All-attention network (36 layers)	Number of params	114M	# 10
Language Modelling	Text8	All-attention network - 36 layers	Bit per Character (BPC)	1.08	# 5
Language Modelling	Text8	All-attention network - 36 layers	Number of params	114M	# 6
Language Modelling	Text8	All-attention network - 18 layers	Bit per Character (BPC)	1.11	# 8
Language Modelling	Text8	All-attention network - 18 layers	Number of params	38M	# 13
Language Modelling	WikiText-103	All-attention network (36 layers)	Validation perplexity	19.7	# 20
Language Modelling	WikiText-103	All-attention network (36 layers)	Test perplexity	20.6	# 44
Language Modelling	WikiText-103	All-attention network (36 layers)	Number of params	133M	# 36

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/augmenting-self-attention-with-persistent/language-modelling-on-text8)](https://paperswithcode.com/sota/language-modelling-on-text8?p=augmenting-self-attention-with-persistent)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/augmenting-self-attention-with-persistent/language-modelling-on-enwiki8)](https://paperswithcode.com/sota/language-modelling-on-enwiki8?p=augmenting-self-attention-with-persistent)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/augmenting-self-attention-with-persistent/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=augmenting-self-attention-with-persistent)`

Augmenting Self-attention with Persistent Memory

2 Jul 2019 · Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, Armand Joulin ·

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.

PDF Abstract

Code

Add Remove Mark official

lucidrains/x-transformers

4,193

facebookresearch/adaptive-span

606

Tasks

Add Remove

Language Modelling

Translation

Datasets

WikiText-2

WikiText-103 Text8

Results from the Paper

Edit

Ranked #5 on Language Modelling on Text8

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	enwik8	All-attention network (18 layers)	Bit per Character (BPC)	1.01	# 19	Compare
Language Modelling	enwik8	All-attention network (18 layers)	Number of params	39M	# 30	Compare
Language Modelling	enwik8	All-attention network (36 layers)	Number of params	114M	# 10	Compare
Language Modelling	Text8	All-attention network - 36 layers	Bit per Character (BPC)	1.08	# 5	Compare
Language Modelling	Text8	All-attention network - 36 layers	Number of params	114M	# 6	Compare
Language Modelling	Text8	All-attention network - 18 layers	Bit per Character (BPC)	1.11	# 8	Compare
Language Modelling	Text8	All-attention network - 18 layers	Number of params	38M	# 13	Compare
Language Modelling	WikiText-103	All-attention network (36 layers)	Validation perplexity	19.7	# 20	Compare
			Test perplexity	20.6	# 44	Compare
			Number of params	133M	# 36	Compare

Methods

Add Remove

AdaGrad • Adam • Adaptive Masking • All-Attention Layer • L1 Regularization

Edit Social Preview

Augmenting Self-attention with Persistent Memory

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove