A Sparse Transformer is a Transformer based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to $O(n \sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured residual block and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage
Source: Generating Long Sequences with Sparse TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 6 | 5.71% |
Decoder | 5 | 4.76% |
Language Modeling | 4 | 3.81% |
Text Classification | 4 | 3.81% |
Diversity | 3 | 2.86% |
Object | 3 | 2.86% |
Object Detection | 3 | 2.86% |
Semantic Segmentation | 3 | 2.86% |
Image Restoration | 3 | 2.86% |