Augmenting Self-attention with Persistent Memory

2 Jul 2019Sainbayar SukhbaatarEdouard GraveGuillaume LampleHerve JegouArmand Joulin

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer... (read more)

PDF Abstract

Evaluation Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK COMPARE
Language Modelling enwiki8 All-attention network - 18 layers Bit per Character (BPC) 1.01 # 6
Language Modelling enwiki8 All-attention network - 18 layers Number of params 39M # 1
Language Modelling enwiki8 All-attention network - 36 layers Bit per Character (BPC) 0.98 # 4
Language Modelling enwiki8 All-attention network - 36 layers Number of params 114M # 1
Language Modelling Text8 All-attention network - 36 layers Bit per Character (BPC) 1.08 # 4
Language Modelling Text8 All-attention network - 36 layers Number of params 114M # 1
Language Modelling Text8 All-attention network - 18 layers Bit per Character (BPC) 1.11 # 5
Language Modelling Text8 All-attention network - 18 layers Number of params 38M # 1
Language Modelling WikiText-103 All-attention network (36 layers) Validation perplexity 19.7 # 8
Language Modelling WikiText-103 All-attention network (36 layers) Test perplexity 20.6 # 9
Language Modelling WikiText-103 All-attention network (36 layers) Number of params 133M # 1