Generating Long Sequences with Sparse Transformers

Preprint 2019 Rewon ChildScott GrayAlec RadfordIlya Sutskever

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O\left(n\sqrt{n}\right)$... (read more)

PDF Abstract

Results from the Paper


 SOTA for Image Generation on CIFAR-10 (NLL Test metric )

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK LEADERBOARD
Image Generation CIFAR-10 Sparse Transformer 59M (strided) NLL Test 2.80 # 1
Audio Generation Classical music, 5 seconds at 12 kHz Sparse Transformer 152M (strided) Bits per byte 1.97 # 1
Language Modelling enwiki8 Sparse Transformer (fixed) Bit per Character (BPC) 0.99 # 7
Number of params 95M # 1
Image Generation ImageNet 64x64 Sparse Transformer 152M (strided) Bits per byte 3.44 # 1