TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	CIFAR-10	ViT-L/16	Percentage correct	99.42	# 4
Image Classification	CIFAR-10	ViT-H/14	Percentage correct	99.5	# 1
Image Classification	CIFAR-10	ViT-H/14	PARAMS	632M	# 239
Image Classification	CIFAR-10	ViT-H/14	Top-1 Accuracy	99.5	# 1
Image Classification	ImageNet	ViT-L/16	Top 1 Accuracy	87.76%	# 81
Image Classification	ImageNet	ViT-H/14	Top 1 Accuracy	88.55%	# 47
Out-of-Distribution Generalization	ImageNet-W	ViT-B/32	Carton Gap	+34	# 1
Out-of-Distribution Generalization	ImageNet-W	ViT-B/16	Carton Gap	+26	# 1
Out-of-Distribution Generalization	ImageNet-W	ViT-L/16	Carton Gap	+34	# 1
Dynamic Facial Expression Recognition	MAFW	ViT	WAR	45.04	# 10
Image Classification	ObjectNet	ViT-H/14	Top-5 Accuracy	82.1	# 1
Fine-Grained Image Classification	Oxford-IIIT Pets	ViT-B/16	Top-1 Error Rate	6.2%	# 5
Domain Generalization	VizWiz-Classification	ViT-16/L-224	Accuracy - All Images	49	# 8
Domain Generalization	VizWiz-Classification	ViT-8/B-224	Accuracy - Clean Images	48.9	# 15

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-transformers-1/image-classification-on-cifar-10)](https://paperswithcode.com/sota/image-classification-on-cifar-10?p=an-image-is-worth-16x16-words-transformers-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-transformers-1/out-of-distribution-generalization-on-1)](https://paperswithcode.com/sota/out-of-distribution-generalization-on-1?p=an-image-is-worth-16x16-words-transformers-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-transformers-1/image-classification-on-objectnet)](https://paperswithcode.com/sota/image-classification-on-objectnet?p=an-image-is-worth-16x16-words-transformers-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-transformers-1/fine-grained-image-classification-on-oxford-2)](https://paperswithcode.com/sota/fine-grained-image-classification-on-oxford-2?p=an-image-is-worth-16x16-words-transformers-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-transformers-1/domain-generalization-on-vizwiz)](https://paperswithcode.com/sota/domain-generalization-on-vizwiz?p=an-image-is-worth-16x16-words-transformers-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-transformers-1/dynamic-facial-expression-recognition-on-mafw)](https://paperswithcode.com/sota/dynamic-facial-expression-recognition-on-mafw?p=an-image-is-worth-16x16-words-transformers-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-transformers-1/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=an-image-is-worth-16x16-words-transformers-1)`

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

ICLR 2021 · Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby ·

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

PDF Abstract ICLR 2021 PDF ICLR 2021 Abstract

Code

Add Remove Mark official

google-research/vision_transformer official

↳ Quickstart in

Colab

9,208

huggingface/transformers

124,457

labmlai/annotated_deep_learning_pap…

↳ View annotated code at

labml.ai

47,336

rwightman/pytorch-image-models

29,648

lucidrains/vit-pytorch

17,857

See all 143 implementations

Tasks

Add Remove

Classification

Document Image Classification

Domain Generalization

Dynamic Facial Expression Recognition

Fine-Grained Image Classification

Image Classification

Medical Image Segmentation

Out-of-Distribution Generalization

Semantic Segmentation

Datasets

CIFAR-10

ImageNet

CIFAR-100

Oxford 102 Flower

ObjectNet

JFT-300M Oxford-IIIT Pets

OmniBenchmark

ImageNet-W

VizWiz-Classification

MAFW

Results from the Paper

Edit

Ranked #1 on Image Classification on CIFAR-10

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	CIFAR-10	ViT-L/16	Percentage correct	99.42	# 4	Compare
Image Classification	CIFAR-10	ViT-H/14	Percentage correct	99.5	# 1	Compare
			PARAMS	632M	# 239	Compare
			Top-1 Accuracy	99.5	# 1	Compare
Image Classification	ImageNet	ViT-L/16	Top 1 Accuracy	87.76%	# 81	Compare
Image Classification	ImageNet	ViT-H/14	Top 1 Accuracy	88.55%	# 47	Compare
Out-of-Distribution Generalization	ImageNet-W	ViT-B/32	Carton Gap	+34	# 1	Compare
Out-of-Distribution Generalization	ImageNet-W	ViT-B/16	Carton Gap	+26	# 1	Compare
Out-of-Distribution Generalization	ImageNet-W	ViT-L/16	Carton Gap	+34	# 1	Compare
Dynamic Facial Expression Recognition	MAFW	ViT	WAR	45.04	# 10	Compare
Image Classification	ObjectNet	ViT-H/14	Top-5 Accuracy	82.1	# 1	Compare
Fine-Grained Image Classification	Oxford-IIIT Pets	ViT-B/16	Top-1 Error Rate	6.2%	# 5	Compare
Domain Generalization	VizWiz-Classification	ViT-16/L-224	Accuracy - All Images	49	# 8	Compare
Domain Generalization	VizWiz-Classification	ViT-8/B-224	Accuracy - Clean Images	48.9	# 15	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • FixRes • GELU • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove