Vision Transformers

Class-Attention in Image Transformers

Introduced by Touvron et al. in Going deeper with Image Transformers

CaiT, or Class-Attention in Image Transformers, is a type of vision transformer with several design alterations upon the original ViT. First a new layer scaling approach called LayerScale is used, adding a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0, which improves the training dynamics. Secondly, class-attention layers are introduced to the architecture. This creates an architecture where the transformer layers involving self-attention between patches are explicitly separated from class-attention layers -- that are devoted to extract the content of the processed patches into a single vector so that it can be fed to a linear classifier.

Source: Going deeper with Image Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Image Classification 1 100.00%

Categories