 Attention Patterns

# Fixed Factorized Attention

Introduced by Child et al. in Generating Long Sequences with Sparse Transformers

Fixed Factorized Attention is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the Sparse Transformer architecture.

A self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \text{set}\left(S_{1}, \dots, S_{n}\right)$, where $S_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:

$$\text{Attend}\left(X, S\right) = \left(a\left(\mathbf{x}_{i}, S_{i}\right)\right)_{i\in\text{set}\left(1,\dots,n\right)}$$

$$a\left(\mathbf{x}_{i}, S_{i}\right) = \text{softmax}\left(\frac{\left(W_{q}\mathbf{x}_{i}\right)K^{T}_{S_{i}}}{\sqrt{d}}\right)V_{S_{i}}$$

$$K_{Si} = \left(W_{k}\mathbf{x}_{j}\right)_{j\in{S_{i}}}$$

$$V_{Si} = \left(W_{v}\mathbf{x}_{j}\right)_{j\in{S_{i}}}$$

Here $W_{q}$, $W_{k}$, and $W_{v}$ represent the weight matrices which transform a given $x_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.

Full self-attention for autoregressive models defines $S_{i} = \text{set}\left(j : j \leq i\right)$, allowing every element to attend to all previous positions and its own position.

Factorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A_{i}^{(m)} ⊂ \text{set}\left(j : j \leq i\right)$ and lets $S_{i} = A_{i}^{(m)}$. The goal with the Sparse Transformer was to find efficient choices for the subset $A$.

Formally for Fixed Factorized Attention, $A^{(1)}_{i} =${$j : \left(\lfloor{j/l\rfloor}=\lfloor{i/l\rfloor}\right)$}, where the brackets denote the floor operation, and $A^{(2)}_{i} =${$j : j \mod l \in${$t, t+1, \ldots, l$}}, where $t=l-c$ and $c$ is a hyperparameter. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}_{i}$ or $A^{(2)}_{i}$. This pattern can be visualized in the figure to the right.

If the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120-128, all positions greater than 256 can attend to 248-256, and so forth.

A fixed-attention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \in${$8, 16, 32$} for typical values of $l \in {128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the strided attention.

Additionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.

#### Papers

Paper Code Results Date Stars