Generating Long Sequences with Sparse Transformers

Preprint 2019 Rewon ChildScott GrayAlec RadfordIlya Sutskever

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O\left(n\sqrt{n}\right)$... (read more)

PDF Abstract
Task Dataset Model Metric name Metric value Global rank Compare
Image Generation CIFAR-10 Sparse Transformer 59M (strided) NLL Test 2.80 # 1
Audio Generation Classical music, 5 seconds at 12 kHz Sparse Transformer 152M (strided) Bits per byte 1.97 # 1
Language Modelling enwiki8 Sparse Transformer (fixed) Bit per Character (BPC) 0.99 # 5
Language Modelling enwiki8 Sparse Transformer (fixed) Number of params 95M # 1
Image Generation ImageNet 64x64 Sparse Transformer 152M (strided) Bits per byte 3.44 # 1