Class Attention

Introduced by Touvron et al. in Going deeper with Image Transformers

A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding $x_{\text {class }}$ (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings $x_{\text {patches }} .$

Considering a network with $h$ heads and $p$ patches, and denoting by $d$ the embedding size, the multi-head class-attention is parameterized with several projection matrices, $W_{q}, W_{k}, W_{v}, W_{o} \in \mathbf{R}^{d \times d}$, and the corresponding biases $b_{q}, b_{k}, b_{v}, b_{o} \in \mathbf{R}^{d} .$ With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as $z=\left[x_{\text {class }}, x_{\text {patches }}\right]$. We then perform the projections:

$$Q=W_{q} x_{\text {class }}+b_{q}$$

$$K=W_{k} z+b_{k}$$

$$V=W_{v} z+b_{v}$$

The class-attention weights are given by

$$ A=\operatorname{Softmax}\left(Q . K^{T} / \sqrt{d / h}\right) $$

where $Q . K^{T} \in \mathbf{R}^{h \times 1 \times p}$. This attention is involved in the weighted sum $A \times V$ to produce the residual output vector

$$ \operatorname{out}_{\mathrm{CA}}=W_{o} A V+b_{o} $$

which is in turn added to $x_{\text {class }}$ for subsequent processing.

Source: Going deeper with Image Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Semantic Segmentation	3	8.33%
Image Classification	3	8.33%
Natural Language Understanding	2	5.56%
Efficient ViTs	2	5.56%
Object Detection	2	5.56%
Computed Tomography (CT)	1	2.78%
Model Compression	1	2.78%
Automatic Speech Recognition (ASR)	1	2.78%
Language Modelling	1	2.78%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Attention

Attention Mechanisms