Sparse Transformer

Introduced by Child et al. in Generating Long Sequences with Sparse Transformers

A Sparse Transformer is a Transformer based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to $O(n \sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured residual block and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage

Source: Generating Long Sequences with Sparse Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	6	8.22%
Text Classification	4	5.48%
Object Detection	3	4.11%
Question Answering	3	4.11%
Machine Translation	3	4.11%
Translation	3	4.11%
Image Restoration	2	2.74%
Semantic Segmentation	2	2.74%
Image Captioning	2	2.74%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Adam	Stochastic Optimization
Attention Dropout	Regularization
Dense Connections	Feedforward Networks
Dropout	Regularization
Fixed Factorized Attention	Attention Patterns	(optional)
GELU	Activation Functions
Layer Normalization	Normalization
Linear Warmup With Cosine Annealing	Learning Rate Schedules
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Softmax	Output Functions
Strided Attention	Attention Patterns	(optional)
Weight Decay	Regularization

Categories

Add Remove

Transformers

Autoregressive Transformers