TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	ViT-G/14	Top 1 Accuracy	90.45%	# 10
Image Classification	ImageNet	ViT-G/14	Number of params	1843M	# 962
Image Classification	ImageNet	ViT-G/14	Hardware Burden	None	# 1
Image Classification	ImageNet	ViT-G/14	Operations per network pass	None	# 1
Image Classification	ImageNet	ViT-G/14	GFLOPs	2859.9	# 493
Image Classification	ImageNet ReaL	ViT-G/14	Accuracy	90.81%	# 11
Image Classification	ImageNet V2	ViT-G/14	Top 1 Accuracy	83.33	# 6
Image Classification	ObjectNet	ViT-G/14	Top-1 Accuracy	70.53	# 14
Image Classification	ObjectNet	NS (Eff.-L2)	Top-1 Accuracy	68.5	# 16
Image Classification	VTAB-1k	ViT-G/14	Top-1 Accuracy	78.29	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers/image-classification-on-vtab-1k-1)](https://paperswithcode.com/sota/image-classification-on-vtab-1k-1?p=scaling-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers/image-classification-on-imagenet-v2)](https://paperswithcode.com/sota/image-classification-on-imagenet-v2?p=scaling-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=scaling-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers/image-classification-on-imagenet-real)](https://paperswithcode.com/sota/image-classification-on-imagenet-real?p=scaling-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers/image-classification-on-objectnet)](https://paperswithcode.com/sota/image-classification-on-objectnet?p=scaling-vision-transformers)`

Scaling Vision Transformers

CVPR 2022 · Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer ·

Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy of the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well for few-shot transfer, for example, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

google-research/big_vision official

↳ Quickstart in

Colab

1,552

Tasks

Add Remove

Few-Shot Image Classification

Few-Shot Learning

Image Classification

Datasets

Introduced in the Paper:

JFT-3B

Used in the Paper:

ImageNet

ObjectNet

JFT-300M

Results from the Paper

Edit

Ranked #3 on Image Classification on VTAB-1k (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	ViT-G/14	Top 1 Accuracy	90.45%	# 10	Compare
			Number of params	1843M	# 962	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare
			GFLOPs	2859.9	# 493	Compare
Image Classification	ImageNet ReaL	ViT-G/14	Accuracy	90.81%	# 11	Compare
Image Classification	ImageNet V2	ViT-G/14	Top 1 Accuracy	83.33	# 6	Compare
Image Classification	ObjectNet	ViT-G/14	Top-1 Accuracy	70.53	# 14	Compare
Image Classification	ObjectNet	NS (Eff.-L2)	Top-1 Accuracy	68.5	# 16	Compare
Image Classification	VTAB-1k	ViT-G/14	Top-1 Accuracy	78.29	# 3	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Scaling Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove