Multi-Head Linear Attention is a type of linear multi-head self-attention module, proposed with the Linformer architecture. The main idea is to add two linear projection matrices $E_{i}, F_{i} \in \mathbb{R}^{n\times{k}}$ when computing key and value. We first project the original $\left(n \times d\right)$-dimensional key and value layers $KW_{i}^{K}$ and $VW_{i}^{V}$ into $\left(k\times{d}\right)$-dimensional projected key and value layers. We then compute a $\left(n\times{k}\right)$ dimensional context mapping $\bar{P}$ using scaled-dot product attention:
$$ \bar{\text{head}_{i}} = \text{Attention}\left(QW^{Q}_{i}, E_{i}KW_{i}^{K}, F_{i}VW_{i}^{V}\right) $$
$$ \bar{\text{head}_{i}} = \text{softmax}\left(\frac{QW^{Q}_{i}\left(E_{i}KW_{i}^{K}\right)^{T}}{\sqrt{d_{k}}}\right) \cdot F_{i}VW_{i}^{V} $$
Finally, we compute context embeddings for each head using $\bar{P} \cdot \left(F_{i}{V}W_{i}^{V}\right)$.
Source: Linformer: Self-Attention with Linear ComplexityPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 3 | 7.32% |
Prediction | 2 | 4.88% |
Language Modeling | 2 | 4.88% |
Classification | 2 | 4.88% |
Survey | 2 | 4.88% |
Image Super-Resolution | 1 | 2.44% |
Super-Resolution | 1 | 2.44% |
Mamba | 1 | 2.44% |
Deblurring | 1 | 2.44% |
Component | Type |
|
---|---|---|
![]() |
Feedforward Networks | |
![]() |
Attention Mechanisms |