Fixed Factorized Attention Explained

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**Fixed Factorized Attention** is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the [Sparse Transformer](https://paperswithcode.com/method/sparse-transformer) architecture.

A self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \text{set}\left(S\_{1}, \dots, S\_{n}\right)$, where $S\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:

$$ \text{Attend}\left(X, S\right) = \left(a\left(\mathbf{x}\_{i}, S\_{i}\right)\right)\_{i\in\text{set}\left(1,\dots,n\right)}$$

$$ a\left(\mathbf{x}\_{i}, S\_{i}\right) = \text{softmax}\left(\frac{\left(W\_{q}\mathbf{x}\_{i}\right)K^{T}\_{S\_{i}}}{\sqrt{d}}\right)V\_{S\_{i}} $$

$$ K\_{Si} = \left(W\_{k}\mathbf{x}\_{j}\right)\_{j\in{S\_{i}}} $$

$$ V\_{Si} = \left(W\_{v}\mathbf{x}\_{j}\right)\_{j\in{S\_{i}}} $$

Here $W\_{q}$, $W\_{k}$, and $W\_{v}$ represent the weight matrices which transform a given $x\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.

Full self-attention for autoregressive models defines $S\_{i} = \text{set}\left(j : j \leq i\right)$, allowing every element to attend to all previous positions and its own position.

Factorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\_{i}^{(m)} ⊂ \text{set}\left(j : j \leq i\right)$ and lets $S\_{i} = A\_{i}^{(m)}$. The goal with the Sparse [Transformer](https://paperswithcode.com/method/transformer) was to find efficient choices for the subset $A$.

Formally for Fixed Factorized Attention, $A^{(1)}\_{i} = ${$j : \left(\lfloor{j/l\rfloor}=\lfloor{i/l\rfloor}\right)$}, where the brackets denote the floor operation, and $A^{(2)}\_{i} = ${$j : j \mod l \in ${$t, t+1, \ldots, l$}}, where $t=l-c$ and $c$ is a hyperparameter. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\_{i}$ or $A^{(2)}\_{i}$. This pattern can be visualized in the figure to the right.

If the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120-128, all positions greater than 256 can attend to 248-256, and so forth.

A fixed-attention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \in ${$8, 16, 32$} for typical values of $l \in
{128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the [strided attention](https://paperswithcode.com/method/strided-attention).

Additionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.

Code Snippet URL (optional):

Image

Currently: methods/Screen_Shot_2020-05-30_at_5.19.41_PM.png Clear
Change:

Attached collections:

ATTENTION PATTERNS

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Task	Papers	Share
Language Modelling	83	10.91%
Large Language Model	49	6.44%
Question Answering	48	6.31%
Prompt Engineering	30	3.94%
Retrieval	30	3.94%
Code Generation	28	3.68%
In-Context Learning	28	3.68%
Sentence	23	3.02%
Benchmarking	18	2.37%

Fixed Factorized Attention

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove