TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	DynamicViT-LV-M/0.8	Top 1 Accuracy	83.9	# 347
Image Classification	ImageNet	DynamicViT-LV-M/0.8	Number of params	57.1M	# 759
Image Classification	ImageNet	DynamicViT-LV-M/0.8	Hardware Burden	None	# 1
Image Classification	ImageNet	DynamicViT-LV-M/0.8	Operations per network pass	None	# 1
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (80%)	Top 1 Accuracy	79.8	# 4
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (80%)	GFLOPs	3.4	# 33
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (90%)	Top 1 Accuracy	79.8	# 4
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (90%)	GFLOPs	4.0	# 39
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (70%)	Top 1 Accuracy	79.3	# 25
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (70%)	GFLOPs	2.9	# 19
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (70%)	Top 1 Accuracy	83.0	# 10
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (70%)	GFLOPs	4.6	# 8
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (80%)	Top 1 Accuracy	83.2	# 5
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (80%)	GFLOPs	5.1	# 3
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (90%)	Top 1 Accuracy	83.3	# 3
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (90%)	GFLOPs	5.8	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dynamicvit-efficient-vision-transformers-with/efficient-vits-on-imagenet-1k-with-lv-vit-s)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-lv-vit-s?p=dynamicvit-efficient-vision-transformers-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dynamicvit-efficient-vision-transformers-with/efficient-vits-on-imagenet-1k-with-deit-s)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-deit-s?p=dynamicvit-efficient-vision-transformers-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dynamicvit-efficient-vision-transformers-with/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=dynamicvit-efficient-vision-transformers-with)`

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

NeurIPS 2021 · Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie zhou, Cho-Jui Hsieh ·

Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. To optimize the prediction module in an end-to-end manner, we propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens. Benefiting from the nature of self-attention, the unstructured sparse tokens are still hardware friendly, which makes our framework easy to achieve actual speed-up. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%~37% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision transformers. Equipped with the dynamic token sparsification framework, DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet. Code is available at https://github.com/raoyongming/DynamicViT

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Code

Add Remove Mark official

raoyongming/DynamicViT official

↳ Quickstart in

Colab

532

Tasks

Add Remove

Blocking

Efficient ViTs

Image Classification

Datasets

ImageNet

Results from the Paper

Add Remove

Ranked #3 on Efficient ViTs on ImageNet-1K (With LV-ViT-S)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	DynamicViT-LV-M/0.8	Top 1 Accuracy	83.9	# 347	Compare
			Number of params	57.1M	# 759	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (80%)	Top 1 Accuracy	79.8	# 4	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (80%)	GFLOPs	3.4	# 33	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (90%)	Top 1 Accuracy	79.8	# 4	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (90%)	GFLOPs	4.0	# 39	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (70%)	Top 1 Accuracy	79.3	# 25	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	DynamicViT (70%)	GFLOPs	2.9	# 19	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (70%)	Top 1 Accuracy	83.0	# 10	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (70%)	GFLOPs	4.6	# 8	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (80%)	Top 1 Accuracy	83.2	# 5	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (80%)	GFLOPs	5.1	# 3	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (90%)	Top 1 Accuracy	83.3	# 3	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	DynamicViT (90%)	GFLOPs	5.8	# 2	Compare

Methods

Add Remove

Pruning

Edit Social Preview

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove