Vision Transformers

Vision Transformer

Introduced by Dosovitskiy et al. in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.

Source: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Image Classification 52 5.50%
Semantic Segmentation 51 5.40%
Object Detection 36 3.81%
Self-Supervised Learning 27 2.86%
Decoder 23 2.43%
Image Segmentation 21 2.22%
Classification 21 2.22%
Computational Efficiency 16 1.69%
Language Modelling 14 1.48%

Categories