Autoregressive Transformers

# Sparse Transformer

Introduced by Child et al. in Generating Long Sequences with Sparse Transformers

A Sparse Transformer is a Transformer based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to $O(n \sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured residual block and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage

#### Papers

Paper Code Results Date Stars