TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	CIFAR-10	DVT (T2T-ViT-24)	Percentage correct	98.53	# 33
Image Classification	CIFAR-100	DVT (T2T-ViT-24)	Percentage correct	89.63	# 29
Image Classification	ImageNet	DVT (T2T-ViT-7)	Top 1 Accuracy	78.48%	# 764
Image Classification	ImageNet	DVT (T2T-ViT-7)	GFLOPs	0.6	# 65
Image Classification	ImageNet	DVT (T2T-ViT-10)	Top 1 Accuracy	79.74%	# 684
Image Classification	ImageNet	DVT (T2T-ViT-10)	GFLOPs	0.7	# 83
Image Classification	ImageNet	DVT (T2T-ViT-12)	Top 1 Accuracy	80.43%	# 644
Image Classification	ImageNet	DVT (T2T-ViT-12)	Hardware Burden	None	# 1
Image Classification	ImageNet	DVT (T2T-ViT-12)	Operations per network pass	None	# 1
Image Classification	ImageNet	DVT (T2T-ViT-12)	GFLOPs	1.7	# 135

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/not-all-images-are-worth-16x16-words-dynamic/image-classification-on-cifar-100)](https://paperswithcode.com/sota/image-classification-on-cifar-100?p=not-all-images-are-worth-16x16-words-dynamic)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/not-all-images-are-worth-16x16-words-dynamic/image-classification-on-cifar-10)](https://paperswithcode.com/sota/image-classification-on-cifar-10?p=not-all-images-are-worth-16x16-words-dynamic)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/not-all-images-are-worth-16x16-words-dynamic/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=not-all-images-are-worth-16x16-words-dynamic)`

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

NeurIPS 2021 · Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, Gao Huang ·

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16 or 14x14. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input. In fact, we have observed that there exist a considerable number of "easy" images which can be accurately predicted with a mere number of 4x4 tokens, while only a small fraction of "hard" ones need a finer representation. Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image. This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently confident prediction is produced. We further design efficient feature reuse and relationship reuse mechanisms across different components of the Dynamic Transformer to reduce redundant computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 demonstrate that our method significantly outperforms the competitive baselines in terms of both theoretical computational efficiency and practical inference speed. Code and pre-trained models (based on PyTorch and MindSpore) are available at https://github.com/blackfeather-wang/Dynamic-Vision-Transformer and https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Code

Add Remove Mark official

blackfeather-wang/Dynamic-Vision-Tr… official

241

blackfeather-wang/dynamic-vision-tr… official

Tasks

Add Remove

Computational Efficiency

Image Classification

Datasets

CIFAR-10

ImageNet

CIFAR-100

Results from the Paper

Edit

Ranked #29 on Image Classification on CIFAR-100 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	CIFAR-10	DVT (T2T-ViT-24)	Percentage correct	98.53	# 33	Compare
Image Classification	CIFAR-100	DVT (T2T-ViT-24)	Percentage correct	89.63	# 29	Compare
Image Classification	ImageNet	DVT (T2T-ViT-7)	Top 1 Accuracy	78.48%	# 764	Compare
Image Classification	ImageNet	DVT (T2T-ViT-7)	GFLOPs	0.6	# 65	Compare
Image Classification	ImageNet	DVT (T2T-ViT-10)	Top 1 Accuracy	79.74%	# 684	Compare
Image Classification	ImageNet	DVT (T2T-ViT-10)	GFLOPs	0.7	# 83	Compare
Image Classification	ImageNet	DVT (T2T-ViT-12)	Top 1 Accuracy	80.43%	# 644	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare
			GFLOPs	1.7	# 135	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove