The Vision Transformer is a model for image classification that employs a Transformer-like architecture over patches of the image.
Source: An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleTASK | PAPERS | SHARE |
---|---|---|
Image Classification | 11 | 26.83% |
Object Detection | 3 | 7.32% |
Fine-Grained Image Classification | 3 | 7.32% |
DeepFake Detection | 2 | 4.88% |
Face Swapping | 2 | 4.88% |
Depth Estimation | 2 | 4.88% |
Monocular Depth Estimation | 2 | 4.88% |
Face Recognition | 2 | 4.88% |
Image Reconstruction | 1 | 2.44% |
COMPONENT | TYPE |
|
---|---|---|
![]() |
Feedforward Networks | |
![]() |
Attention Modules | |
![]() |
Attention Mechanisms |