Attention Free Transformer

Introduced by Zhai et al. in An Attention Free Transformer

Attention Free Transformer, or AFT, is an efficient variant of a multi-head attention module that eschews dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes.

Given the input $X$, AFT first linearly transforms them into $Q=X W^{Q}, K=X W^{K}, V=X W^{V}$, then performs following operation:

$$ Y=f(X) ; Y_{t}=\sigma_{q}\left(Q_{t}\right) \odot \frac{\sum_{t^{\prime}=1}^{T} \exp \left(K_{t^{\prime}}+w_{t, t^{\prime}}\right) \odot V_{t^{\prime}}}{\sum_{t^{\prime}=1}^{T} \exp \left(K_{t^{\prime}}+w_{t, t^{\prime}}\right)} $$

where $\odot$ is the element-wise product; $\sigma_{q}$ is the nonlinearity applied to the query with default being sigmoid; $w \in R^{T \times T}$ is the learned pair-wise position biases.

Explained in words, for each target position $t$, AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.

Source: An Attention Free Transformer

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	1	33.33%
Language Modelling	1	33.33%
Fine-Grained Image Classification	1	33.33%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Attention Modules