TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K val	DeiT-L	mIoU	55.6	# 26
Semantic Segmentation	ADE20K val	DeiT-B	mIoU	54.1	# 33
Image Classification	ImageNet	ViT-S @224 (DeiT III, 21k)	Top 1 Accuracy	83.1%	# 426
Image Classification	ImageNet	ViT-S @384 (DeiT III)	Top 1 Accuracy	83.4%	# 394
Image Classification	ImageNet	ViT-S @384 (DeiT III)	Number of params	22M	# 557
Image Classification	ImageNet	ViT-S @384 (DeiT III)	GFLOPs	15.5	# 341
Image Classification	ImageNet	ViT-L @224 (DeiT III)	Top 1 Accuracy	84.9%	# 265
Image Classification	ImageNet	ViT-B @224 (DeiT III, 21k)	Top 1 Accuracy	85.7%	# 200
Image Classification	ImageNet	ViT-B @384 (DeiT III, 21k)	Top 1 Accuracy	86.7%	# 126
Image Classification	ImageNet	ViT-H @224 (DeiT III)	Top 1 Accuracy	85.2%	# 239
Image Classification	ImageNet	ViT-B @384 (DeiT III)	Top 1 Accuracy	85.0%	# 255
Image Classification	ImageNet	ViT-B @384 (DeiT III)	Number of params	87M	# 822
Image Classification	ImageNet	ViT-B @224 (DeiT III)	Top 1 Accuracy	83.8%	# 358
Image Classification	ImageNet	ViT-L	Top 1 Accuracy	85.8%	# 187
Image Classification	ImageNet	ViT-L	Number of params	304.8M	# 914
Image Classification	ImageNet	ViT-L	GFLOPs	191.2	# 468
Image Classification	ImageNet	ViT-S @224 (DeiT III)	Top 1 Accuracy	81.4%	# 586
Image Classification	ImageNet ReaL	ViT-H @224 (DeiT III, 21k)	Top 1 Accuracy	87.2%	# 3
Image Classification	ImageNet ReaL	ViT-H @224 (DeiT III, 21k)	Number of params	632M	# 1
Image Classification	ImageNet ReaL	ViT-L @224 (DeiT III, 21k)	Top 1 Accuracy	87.0%	# 4
Image Classification	ImageNet ReaL	ViT-L @384 (DeiT III, 21k)	Top 1 Accuracy	87.7%	# 1
Image Classification	ImageNet ReaL	ViT-L @384 (DeiT III, 21k)	Number of params	304M	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-iii-revenge-of-the-vit/image-classification-on-imagenet-real)](https://paperswithcode.com/sota/image-classification-on-imagenet-real?p=deit-iii-revenge-of-the-vit)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-iii-revenge-of-the-vit/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=deit-iii-revenge-of-the-vit)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-iii-revenge-of-the-vit/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=deit-iii-revenge-of-the-vit)`

DeiT III: Revenge of the ViT

14 Apr 2022 · Hugo Touvron, Matthieu Cord, Hervé Jégou ·

A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT.

PDF Abstract

Code

Add Remove Mark official

facebookresearch/deit official

3,856

rwightman/pytorch-image-models

29,671

open-mmlab/mmclassification

3,137

alibaba/EasyCV

1,671

affjljoo3581/deit3-jax

See all 9 implementations

Tasks

Add Remove

Data Augmentation

Image Classification

Self-Supervised Learning

Semantic Segmentation

Transfer Learning

Datasets

CIFAR-10

ImageNet

CIFAR-100

Oxford 102 Flower

ADE20K ImageNet-1K

Results from the Paper

Edit

Ranked #1 on Image Classification on ImageNet ReaL (Number of params metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K val	DeiT-L	mIoU	55.6	# 26	Compare
Semantic Segmentation	ADE20K val	DeiT-B	mIoU	54.1	# 33	Compare
Image Classification	ImageNet	ViT-S @224 (DeiT III, 21k)	Top 1 Accuracy	83.1%	# 426	Compare
Image Classification	ImageNet	ViT-S @384 (DeiT III)	Top 1 Accuracy	83.4%	# 394	Compare
			Number of params	22M	# 557	Compare
			GFLOPs	15.5	# 341	Compare
Image Classification	ImageNet	ViT-L @224 (DeiT III)	Top 1 Accuracy	84.9%	# 265	Compare
Image Classification	ImageNet	ViT-B @224 (DeiT III, 21k)	Top 1 Accuracy	85.7%	# 200	Compare
Image Classification	ImageNet	ViT-B @384 (DeiT III, 21k)	Top 1 Accuracy	86.7%	# 126	Compare
Image Classification	ImageNet	ViT-H @224 (DeiT III)	Top 1 Accuracy	85.2%	# 239	Compare
Image Classification	ImageNet	ViT-B @384 (DeiT III)	Top 1 Accuracy	85.0%	# 255	Compare
Image Classification	ImageNet	ViT-B @384 (DeiT III)	Number of params	87M	# 822	Compare
Image Classification	ImageNet	ViT-B @224 (DeiT III)	Top 1 Accuracy	83.8%	# 358	Compare
Image Classification	ImageNet	ViT-L	Top 1 Accuracy	85.8%	# 187	Compare
			Number of params	304.8M	# 914	Compare
			GFLOPs	191.2	# 468	Compare
Image Classification	ImageNet	ViT-S @224 (DeiT III)	Top 1 Accuracy	81.4%	# 586	Compare
Image Classification	ImageNet ReaL	ViT-H @224 (DeiT III, 21k)	Top 1 Accuracy	87.2%	# 3	Compare
Image Classification	ImageNet ReaL	ViT-H @224 (DeiT III, 21k)	Number of params	632M	# 1	Compare
Image Classification	ImageNet ReaL	ViT-L @224 (DeiT III, 21k)	Top 1 Accuracy	87.0%	# 4	Compare
Image Classification	ImageNet ReaL	ViT-L @384 (DeiT III, 21k)	Top 1 Accuracy	87.7%	# 1	Compare
Image Classification	ImageNet ReaL	ViT-L @384 (DeiT III, 21k)	Number of params	304M	# 2	Compare

Methods

Add Remove

3-Augment • Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • FixRes • Label Smoothing • Layer Normalization • LayerScale • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

DeiT III: Revenge of the ViT

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove