TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Generation	CIFAR-10	Sparse Transformer 59M (strided)	bits/dimension	2.80	# 13
Audio Generation	Classical music, 5 seconds at 12 kHz	Sparse Transformer 152M (strided)	Bits per byte	1.97	# 1
Language Modelling	enwik8	Sparse Transformer (30 layers, fixed attn)	Bit per Character (BPC)	0.99	# 12
Language Modelling	enwik8	Sparse Transformer (30 layers, fixed attn)	Number of params	95M	# 14
Image Generation	ImageNet 64x64	Sparse Transformer 59M (strided)	Bits per dim	3.44	# 6
Question Answering	Natural Questions (long)	Sparse Attention	F1	74.5	# 4
Question Answering	Quasart-T	Sparse Attention	EM	52.1	# 3
Open-Domain Question Answering	SearchQA	Sparse Attention	EM	64.7	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190410509/audio-generation-on-classical-music-5-seconds)](https://paperswithcode.com/sota/audio-generation-on-classical-music-5-seconds?p=190410509)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190410509/question-answering-on-quasart-t)](https://paperswithcode.com/sota/question-answering-on-quasart-t?p=190410509)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190410509/question-answering-on-natural-questions-long)](https://paperswithcode.com/sota/question-answering-on-natural-questions-long?p=190410509)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190410509/open-domain-question-answering-on-searchqa)](https://paperswithcode.com/sota/open-domain-question-answering-on-searchqa?p=190410509)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190410509/image-generation-on-imagenet-64x64)](https://paperswithcode.com/sota/image-generation-on-imagenet-64x64?p=190410509)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190410509/language-modelling-on-enwiki8)](https://paperswithcode.com/sota/language-modelling-on-enwiki8?p=190410509)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190410509/image-generation-on-cifar-10)](https://paperswithcode.com/sota/image-generation-on-cifar-10?p=190410509)`

Generating Long Sequences with Sparse Transformers

Preprint 2019 · Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever ·

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

PDF Abstract Preprint 2019 PDF Preprint 2019 Abstract

Code

Add Remove Mark official

openai/sparse_attention official

1,480

mistralai/mistral-src

↳ Quickstart in

Replicate

8,601

wilson1yan/VideoGPT

↳ Quickstart in

Colab

Spaces

874

ptillet/torch-blocksparse

141

han-shi/SparseBERT

See all 6 implementations

Tasks

Add Remove

Image Generation

Language Modelling

Open-Domain Question Answering

Question Answering

Datasets

CIFAR-10

Natural Questions

SearchQA

QUASAR-T

Results from the Paper

Edit

Ranked #1 on Audio Generation on Classical music, 5 seconds at 12 kHz

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Generation	CIFAR-10	Sparse Transformer 59M (strided)	bits/dimension	2.80	# 13	Compare
Audio Generation	Classical music, 5 seconds at 12 kHz	Sparse Transformer 152M (strided)	Bits per byte	1.97	# 1	Compare
Image Generation	ImageNet 64x64	Sparse Transformer 59M (strided)	Bits per dim	3.44	# 6	Compare
Question Answering	Natural Questions (long)	Sparse Attention	F1	74.5	# 4	Compare

Results from Other Papers

Task	Dataset	Model	Metric Name	Metric Value	Rank	Compare
Language Modelling	enwik8	Sparse Transformer (30 layers, fixed attn)	Bit per Character (BPC)	0.99	# 12	See all
Language Modelling	enwik8	Sparse Transformer (30 layers, fixed attn)	Number of params	95M	# 14	See all
Question Answering	Quasart-T	Sparse Attention	EM	52.1	# 3	See all
Open-Domain Question Answering	SearchQA	Sparse Attention	EM	64.7	# 4	See all

Methods

Add Remove

Adam • Attention Dropout • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Sparse Transformer • Strided Attention • Weight Decay

Edit Social Preview

Generating Long Sequences with Sparse Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit