Fixed Factorized Attention is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the Sparse Transformer architecture.
A selfattention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \text{set}\left(S_{1}, \dots, S_{n}\right)$, where $S_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:
$$ \text{Attend}\left(X, S\right) = \left(a\left(\mathbf{x}_{i}, S_{i}\right)\right)_{i\in\text{set}\left(1,\dots,n\right)}$$
$$ a\left(\mathbf{x}_{i}, S_{i}\right) = \text{softmax}\left(\frac{\left(W_{q}\mathbf{x}_{i}\right)K^{T}_{S_{i}}}{\sqrt{d}}\right)V_{S_{i}} $$
$$ K_{Si} = \left(W_{k}\mathbf{x}_{j}\right)_{j\in{S_{i}}} $$
$$ V_{Si} = \left(W_{v}\mathbf{x}_{j}\right)_{j\in{S_{i}}} $$
Here $W_{q}$, $W_{k}$, and $W_{v}$ represent the weight matrices which transform a given $x_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dotproduct similarity of the keys and queries.
Full selfattention for autoregressive models defines $S_{i} = \text{set}\left(j : j \leq i\right)$, allowing every element to attend to all previous positions and its own position.
Factorized selfattention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A_{i}^{(m)} ⊂ \text{set}\left(j : j \leq i\right)$ and lets $S_{i} = A_{i}^{(m)}$. The goal with the Sparse Transformer was to find efficient choices for the subset $A$.
Formally for Fixed Factorized Attention, $A^{(1)}_{i} = ${$j : \left(\lfloor{j/l\rfloor}=\lfloor{i/l\rfloor}\right)$}, where the brackets denote the floor operation, and $A^{(2)}_{i} = ${$j : j \mod l \in ${$t, t+1, \ldots, l$}}, where $t=lc$ and $c$ is a hyperparameter. The $i$th output vector of the attention head attends to all input vectors either from $A^{(1)}_{i}$ or $A^{(2)}_{i}$. This pattern can be visualized in the figure to the right.
If the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120128, all positions greater than 256 can attend to 248256, and so forth.
A fixedattention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \in ${$8, 16, 32$} for typical values of $l \in {128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the strided attention.
Additionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.
Source: Generating Long Sequences with Sparse TransformersPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Language Modelling  54  19.64% 
FewShot Learning  26  9.45% 
Question Answering  25  9.09% 
Text Generation  16  5.82% 
Pretrained Language Models  13  4.73% 
Natural Language Inference  11  4.00% 
ZeroShot Learning  10  3.64% 
Natural Language Understanding  9  3.27% 
Semantic Parsing  6  2.18% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 