Attention Mechanisms

Class Attention

Introduced by Touvron et al. in Going deeper with Image Transformers

A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding $x_{\text {class }}$ (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings $x_{\text {patches }} .$

Considering a network with $h$ heads and $p$ patches, and denoting by $d$ the embedding size, the multi-head class-attention is parameterized with several projection matrices, $W_{q}, W_{k}, W_{v}, W_{o} \in \mathbf{R}^{d \times d}$, and the corresponding biases $b_{q}, b_{k}, b_{v}, b_{o} \in \mathbf{R}^{d} .$ With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as $z=\left[x_{\text {class }}, x_{\text {patches }}\right]$. We then perform the projections:

$$Q=W_{q} x_{\text {class }}+b_{q}$$

$$K=W_{k} z+b_{k}$$

$$V=W_{v} z+b_{v}$$

The class-attention weights are given by

$$ A=\operatorname{Softmax}\left(Q . K^{T} / \sqrt{d / h}\right) $$

where $Q . K^{T} \in \mathbf{R}^{h \times 1 \times p}$. This attention is involved in the weighted sum $A \times V$ to produce the residual output vector

$$ \operatorname{out}_{\mathrm{CA}}=W_{o} A V+b_{o} $$

which is in turn added to $x_{\text {class }}$ for subsequent processing.

Source: Going deeper with Image Transformers


Paper Code Results Date Stars


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign