TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	LV-ViT-L (UperNet, MS)	Validation mIoU	51.8	# 84
Semantic Segmentation	ADE20K	LV-ViT-L (UperNet, MS)	Params (M)	209	# 19
Image Classification	ImageNet	LV-ViT-L	Top 1 Accuracy	86.4%	# 143
Image Classification	ImageNet	LV-ViT-L	Number of params	151M	# 881
Image Classification	ImageNet	LV-ViT-L	GFLOPs	214.8	# 470
Image Classification	ImageNet	LV-ViT-M	Top 1 Accuracy	84.1%	# 325
Image Classification	ImageNet	LV-ViT-M	Number of params	56M	# 748
Image Classification	ImageNet	LV-ViT-M	GFLOPs	16	# 346
Image Classification	ImageNet	LV-ViT-S	Top 1 Accuracy	83.3%	# 403
Image Classification	ImageNet	LV-ViT-S	Number of params	26M	# 607
Image Classification	ImageNet	LV-ViT-S	GFLOPs	6.6	# 244
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	Base (LV-ViT-S)	Top 1 Accuracy	83.3	# 3
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	Base (LV-ViT-S)	GFLOPs	6.6	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/token-labeling-training-a-85-5-top-1-accuracy/efficient-vits-on-imagenet-1k-with-lv-vit-s)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-lv-vit-s?p=token-labeling-training-a-85-5-top-1-accuracy)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/token-labeling-training-a-85-5-top-1-accuracy/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=token-labeling-training-a-85-5-top-1-accuracy)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/token-labeling-training-a-85-5-top-1-accuracy/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=token-labeling-training-a-85-5-top-1-accuracy)`

All Tokens Matter: Token Labeling for Training Better Vision Transformers

NeurIPS 2021 · Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, Jiashi Feng ·

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pre-trained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Code

Add Remove Mark official

zihangJiang/TokenLabeling official

417

naver-ai/vidt

299

PaddlePaddle/PASSL

259

zhoudaquan/Refiner_ViT

106

sail-sg/dualformer

See all 6 implementations

Tasks

Add Remove

Efficient ViTs

General Classification

Image Classification

Semantic Segmentation

Datasets

ImageNet

ADE20K

Results from the Paper

Edit

Ranked #3 on Efficient ViTs on ImageNet-1K (With LV-ViT-S)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	LV-ViT-L (UperNet, MS)	Validation mIoU	51.8	# 84	Compare
Semantic Segmentation	ADE20K	LV-ViT-L (UperNet, MS)	Params (M)	209	# 19	Compare
Image Classification	ImageNet	LV-ViT-L	Top 1 Accuracy	86.4%	# 143	Compare
			Number of params	151M	# 881	Compare
			GFLOPs	214.8	# 470	Compare
Image Classification	ImageNet	LV-ViT-M	Top 1 Accuracy	84.1%	# 325	Compare
			Number of params	56M	# 748	Compare
			GFLOPs	16	# 346	Compare
Image Classification	ImageNet	LV-ViT-S	Top 1 Accuracy	83.3%	# 403	Compare
			Number of params	26M	# 607	Compare
			GFLOPs	6.6	# 244	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	Base (LV-ViT-S)	Top 1 Accuracy	83.3	# 3	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	Base (LV-ViT-S)	GFLOPs	6.6	# 1	Compare

Methods

Add Remove

Dense Connections • Layer Normalization • Linear Layer • LV-ViT • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

All Tokens Matter: Token Labeling for Training Better Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove