An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place... (read more)

PDF Abstract ICLR 2021 PDF ICLR 2021 Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Image Classification CIFAR-10 ViT-L/16 Percentage correct 99.42 # 3
PARAMS 307M # 148
Image Classification CIFAR-10 ViT-H/14 Percentage correct 99.5 # 2
PARAMS 632M # 149
Image Classification CIFAR-100 ViT-H/14 Percentage correct 94.55 # 2
Image Classification CIFAR-100 ViT-L/16 Percentage correct 93.9 # 4
Image Classification Flowers-102 ViT-L/16 Accuracy 99.74 # 2
Image Classification Flowers-102 ViT-H/14 Accuracy 99.68 # 4
Image Classification ImageNet ViT-H/14 Top 1 Accuracy 88.55% # 6
Number of params 632M # 4
Image Classification ImageNet ViT-L/16 Top 1 Accuracy 87.76% # 9
Number of params 307M # 18
Image Classification ImageNet ReaL ViT-H/14 Accuracy 90.72% # 4
Params 632M # 31
Image Classification ImageNet ReaL ViT-L/16 Accuracy 90.54% # 7
Params 307M # 28
Fine-Grained Image Classification Oxford 102 Flowers ViT-H/14 Accuracy 99.68% # 2
PARAMS 632M # 17
Fine-Grained Image Classification Oxford 102 Flowers ViT-L/16 Accuracy 99.74% # 1
PARAMS 307M # 16
Fine-Grained Image Classification Oxford-IIIT Pets ViT-B/16 Top-1 Error Rate 6.2% # 8
Accuracy 93.8% # 13
PARAMS 86.4M # 14
Fine-Grained Image Classification Oxford-IIIT Pets ViT-L/16 Accuracy 97.32% # 2
PARAMS 307M # 15
Fine-Grained Image Classification Oxford-IIIT Pets ViT-H/14 Accuracy 97.56% # 1
PARAMS 632M # 16
Image Classification VTAB-1k ViT-L/16 Top-1 Accuracy 76.28 # 6
Params 307M # 32
Image Classification VTAB-1k ViT-L/16 ("ImageNet-21k") Top-1 Accuracy 72.72 # 7
Image Classification VTAB-1k ViT-H/14 Top-1 Accuracy 77.63 # 3
Params 632M # 33

Methods used in the Paper


METHOD TYPE
FixRes
Image Scaling Strategies
Vision Transformer
Image Models
GELU
Activation Functions
BPE
Subword Segmentation
Softmax
Output Functions
Adam
Stochastic Optimization
Layer Normalization
Normalization
Dense Connections
Feedforward Networks
Multi-Head Attention
Attention Modules
Label Smoothing
Regularization
Dropout
Regularization
Residual Connection
Skip Connections
Scaled Dot-Product Attention
Attention Mechanisms
Transformer
Transformers