A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding $x_{\text {class }}$ (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings $x_{\text {patches }} .$
Considering a network with $h$ heads and $p$ patches, and denoting by $d$ the embedding size, the multi-head class-attention is parameterized with several projection matrices, $W_{q}, W_{k}, W_{v}, W_{o} \in \mathbf{R}^{d \times d}$, and the corresponding biases $b_{q}, b_{k}, b_{v}, b_{o} \in \mathbf{R}^{d} .$ With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as $z=\left[x_{\text {class }}, x_{\text {patches }}\right]$. We then perform the projections:
$$Q=W_{q} x_{\text {class }}+b_{q}$$
$$K=W_{k} z+b_{k}$$
$$V=W_{v} z+b_{v}$$
The class-attention weights are given by
$$ A=\operatorname{Softmax}\left(Q . K^{T} / \sqrt{d / h}\right) $$
where $Q . K^{T} \in \mathbf{R}^{h \times 1 \times p}$. This attention is involved in the weighted sum $A \times V$ to produce the residual output vector
$$ \operatorname{out}_{\mathrm{CA}}=W_{o} A V+b_{o} $$
which is in turn added to $x_{\text {class }}$ for subsequent processing.
Source: Going deeper with Image TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Semantic Segmentation | 3 | 8.33% |
Image Classification | 3 | 8.33% |
Natural Language Understanding | 2 | 5.56% |
Efficient ViTs | 2 | 5.56% |
Object Detection | 2 | 5.56% |
Computed Tomography (CT) | 1 | 2.78% |
Model Compression | 1 | 2.78% |
Automatic Speech Recognition (ASR) | 1 | 2.78% |
Language Modelling | 1 | 2.78% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |