The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.
Source: An Image is Worth 16x16 Words: Transformers for Image Recognition at ScalePaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Classification | 52 | 5.50% |
Semantic Segmentation | 51 | 5.40% |
Object Detection | 36 | 3.81% |
Self-Supervised Learning | 27 | 2.86% |
Decoder | 23 | 2.43% |
Image Segmentation | 21 | 2.22% |
Classification | 21 | 2.22% |
Computational Efficiency | 16 | 1.69% |
Language Modelling | 14 | 1.48% |
Component | Type |
|
---|---|---|
![]() |
Attention Mechanisms | |
![]() |
Feedforward Networks | |
![]() |
Normalization | |
![]() |
Attention Modules | |
![]() |
Skip Connections |