ID	vit_base_patch16_224
LR	0.0008
Epochs	90
Dropout	0.0
Crop Pct	0.9
Batch Size	4096
Image Size	224
Warmup Steps	10000
Weight Decay	0.03
Interpolation	bicubic

ID	vit_base_patch16_384
Crop Pct	1.0
Momentum	0.9
Batch Size	512
Image Size	384
Weight Decay	0.0
Interpolation	bicubic

ID	vit_base_patch32_384
Crop Pct	1.0
Momentum	0.9
Batch Size	512
Image Size	384
Weight Decay	0.0
Interpolation	bicubic

ID	vit_base_resnet50_384
Crop Pct	1.0
Momentum	0.9
Batch Size	512
Image Size	384
Weight Decay	0.0
Interpolation	bicubic

ID	vit_large_patch16_224
Crop Pct	0.9
Momentum	0.9
Batch Size	512
Image Size	224
Weight Decay	0.0
Interpolation	bicubic

ID	vit_large_patch16_384
Crop Pct	1.0
Momentum	0.9
Batch Size	512
Image Size	384
Weight Decay	0.0
Interpolation	bicubic

ID	vit_small_patch16_224
Crop Pct	0.9
Image Size	224
Interpolation	bicubic

Vision Transformer

rwightman / pytorch-image-models

Last updated on Feb 14, 2021

Parameters 87 Million

FLOPs 67 Billion

File Size 330.25 MB

Training Data JFT-300M, ImageNet

Training Resources TPUv3

Training Time

Training Techniques	SGD with Momentum, Cosine Annealing, Gradient Clipping
Architecture	Layer Normalization, Multi-Head Attention, Tanh Activation, Dense Connections, Attention Dropout, Dropout, Scaled Dot-Product Attention, GELU, Convolution
ID	vit_base_patch16_224
LR	0.0008
Epochs	90
Dropout	0.0
Crop Pct	0.9
Batch Size	4096
Image Size	224
Warmup Steps	10000
Weight Decay	0.03
Interpolation	bicubic
SHOW MORE
SHOW LESS

Parameters 87 Million

FLOPs 49 Billion

File Size 331.36 MB

Training Data JFT-300M, ImageNet

Training Resources TPUv3

Training Time

Training Techniques	SGD with Momentum, Cosine Annealing, Gradient Clipping
Architecture	Layer Normalization, Multi-Head Attention, Tanh Activation, Dense Connections, Attention Dropout, Dropout, Scaled Dot-Product Attention, GELU, Convolution
ID	vit_base_patch16_384
Crop Pct	1.0
Momentum	0.9
Batch Size	512
Image Size	384
Weight Decay	0.0
Interpolation	bicubic
SHOW MORE
SHOW LESS

Parameters 88 Million

FLOPs 13 Billion

File Size 336.85 MB

Training Data JFT-300M, ImageNet

Training Resources TPUv3

Training Time

Training Techniques	SGD with Momentum, Cosine Annealing, Gradient Clipping
Architecture	Layer Normalization, Multi-Head Attention, Tanh Activation, Dense Connections, Attention Dropout, Dropout, Scaled Dot-Product Attention, GELU, Convolution
ID	vit_base_patch32_384
Crop Pct	1.0
Momentum	0.9
Batch Size	512
Image Size	384
Weight Decay	0.0
Interpolation	bicubic
SHOW MORE
SHOW LESS

Parameters 99 Million

FLOPs 49 Billion

File Size 377.52 MB

Training Data JFT-300M, ImageNet

Training Resources TPUv3

Training Time

Training Techniques	SGD with Momentum, Cosine Annealing, Gradient Clipping
Architecture	Layer Normalization, Multi-Head Attention, Tanh Activation, Dense Connections, Attention Dropout, Dropout, Scaled Dot-Product Attention, GELU, Convolution
ID	vit_base_resnet50_384
Crop Pct	1.0
Momentum	0.9
Batch Size	512
Image Size	384
Weight Decay	0.0
Interpolation	bicubic
SHOW MORE
SHOW LESS

Parameters 304 Million

FLOPs 119 Billion

File Size 1.16 GB

Training Data JFT-300M, ImageNet

Training Resources TPUv3

Training Time

Training Techniques	SGD with Momentum, Cosine Annealing, Gradient Clipping
Architecture	Layer Normalization, Multi-Head Attention, Tanh Activation, Dense Connections, Attention Dropout, Dropout, Scaled Dot-Product Attention, GELU, Convolution
ID	vit_large_patch16_224
Crop Pct	0.9
Momentum	0.9
Batch Size	512
Image Size	224
Weight Decay	0.0
Interpolation	bicubic
SHOW MORE
SHOW LESS

Parameters 305 Million

FLOPs 175 Billion

File Size 1.16 GB

Training Data JFT-300M, ImageNet

Training Resources TPUv3

Training Time

Training Techniques	SGD with Momentum, Cosine Annealing, Gradient Clipping
Architecture	Layer Normalization, Multi-Head Attention, Tanh Activation, Dense Connections, Attention Dropout, Dropout, Scaled Dot-Product Attention, GELU, Convolution
ID	vit_large_patch16_384
Crop Pct	1.0
Momentum	0.9
Batch Size	512
Image Size	384
Weight Decay	0.0
Interpolation	bicubic
SHOW MORE
SHOW LESS

Parameters 49 Million

FLOPs 28 Billion

File Size 186.00 MB

Training Data JFT-300M, ImageNet

Training Resources TPUv3

Training Time

Training Techniques	SGD with Momentum, Cosine Annealing, Gradient Clipping
Architecture	Layer Normalization, Multi-Head Attention, Tanh Activation, Dense Connections, Attention Dropout, Dropout, Scaled Dot-Product Attention, GELU, Convolution
ID	vit_small_patch16_224
Crop Pct	0.9
Image Size	224
Interpolation	bicubic
SHOW MORE
SHOW LESS

README.md

Summary

The Vision Transformer is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of Multi-Head Attention, Scaled Dot-Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP.

How do I load this model?

To load a pretrained model:

import timm
m = timm.create_model('vit_large_patch16_224', pretrained=True)
m.eval()

Replace the model name with the variant you want to use, e.g. vit_large_patch16_224. You can find the IDs in the model summaries at the top of this page.

How do I train this model?

You can follow the timm recipe scripts for training a new model afresh.

Citation

@misc{dosovitskiy2020image,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale}, 
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2020},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Results

Image Classification on ImageNet

MODEL	TOP 1 ACCURACY	TOP 5 ACCURACY
vit_large_patch16_384	85.17%	97.36%
vit_base_resnet50_384	84.99%	97.3%
vit_base_patch16_384	84.2%	97.22%
vit_large_patch16_224	83.06%	96.44%
vit_base_patch16_224	81.78%	96.13%
vit_base_patch32_384	81.66%	96.13%
vit_small_patch16_224	77.85%	93.42%