Transformers

Sparse Transformer

Introduced by Child et al. in Generating Long Sequences with Sparse Transformers

A Sparse Transformer is a Transformer based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to $O(n \sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured residual block and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage

Source: Generating Long Sequences with Sparse Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 6 5.71%
Decoder 5 4.76%
Language Modeling 4 3.81%
Text Classification 4 3.81%
Diversity 3 2.86%
Object 3 2.86%
Object Detection 3 2.86%
Semantic Segmentation 3 2.86%
Image Restoration 3 2.86%

Categories