Performer is a Transformer architecture which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. To approximate softmax attention-kernels, Performers use a Fast Attention Via positive Orthogonal Random features approach (FAVOR+), leveraging new methods for approximating softmax and Gaussian kernels.
Source: Rethinking Attention with PerformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 6 | 3.75% |
Decoder | 5 | 3.13% |
Language Modeling | 4 | 2.50% |
Classification | 4 | 2.50% |
Time Series Analysis | 4 | 2.50% |
Decision Making | 3 | 1.88% |
Anomaly Detection | 3 | 1.88% |
Computational Efficiency | 3 | 1.88% |
Sentiment Analysis | 3 | 1.88% |