Adaptive Attention Span in Transformers

ACL 2019 Sainbayar SukhbaatarEdouard GravePiotr BojanowskiArmand Joulin

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time... (read more)

PDF Abstract

Evaluation results from the paper


Task Dataset Model Metric name Metric value Global rank Compare
Language Modelling enwiki8 24L Transformer + 8K adaptive span Bit per Character (BPC) 0.98 # 4
Language Modelling enwiki8 24L Transformer + 8K adaptive span Number of params 209M # 1
Language Modelling enwiki8 12L Transformer + 8K adaptive span Bit per Character (BPC) 1.02 # 7
Language Modelling enwiki8 12L Transformer + 8K adaptive span Number of params 39M # 1
Language Modelling Text8 12L Transformer + 8K adaptive span Bit per Character (BPC) 1.11 # 5
Language Modelling Text8 12L Transformer + 8K adaptive span Number of params 38M # 1
Language Modelling Text8 24L Transformer + 8K adaptive span Bit per Character (BPC) 1.07 # 3
Language Modelling Text8 24L Transformer + 8K adaptive span Number of params 209M # 1