Class-Attention in Image Transformers

Introduced by Touvron et al. in Going deeper with Image Transformers

CaiT, or Class-Attention in Image Transformers, is a type of vision transformer with several design alterations upon the original ViT. First a new layer scaling approach called LayerScale is used, adding a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0, which improves the training dynamics. Secondly, class-attention layers are introduced to the architecture. This creates an architecture where the transformer layers involving self-attention between patches are explicitly separated from class-attention layers -- that are devoted to extract the content of the processed patches into a single vector so that it can be fed to a linear classifier.

Source: Going deeper with Image Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	1	100.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Class Attention	Attention
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
LayerScale	Regularization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections

Categories

Add Remove

Vision Transformers