Vision Transformer

Introduced by Dosovitskiy et al. in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.

Source: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Semantic Segmentation	83	9.56%
Image Classification	66	7.60%
Object Detection	36	4.15%
Self-Supervised Learning	32	3.69%
Image Segmentation	24	2.76%
Instance Segmentation	19	2.19%
Classification	17	1.96%
Language Modelling	14	1.61%
Autonomous Driving	14	1.61%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms

Categories

Add Remove

Image Models

Vision Transformers