TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Object Segmentation	DAVIS 2016	ViTAE-T-Stage	Jaccard (Mean)	89.2	# 11
Video Object Segmentation	DAVIS 2016	ViTAE-T-Stage	F-Score	90.4	# 15
Video Object Segmentation	DAVIS 2016	ViTAE-T-Stage	J&F	89.8	# 13
Video Object Segmentation	DAVIS 2017	ViTAE-T-Stage	Jaccard (Mean)	79.4	# 2
Video Object Segmentation	DAVIS 2017	ViTAE-T-Stage	J&F	82.5	# 1
Video Object Segmentation	DAVIS 2017	ViTAE-T-Stage	F-Score	85.5	# 2
Image Classification	ImageNet	ViTAE-6M	Top 1 Accuracy	77.9%	# 792
Image Classification	ImageNet	ViTAE-6M	Number of params	6.5M	# 444
Image Classification	ImageNet	ViTAE-6M	GFLOPs	4	# 191
Image Classification	ImageNet	ViTAE-T	Top 1 Accuracy	75.3%	# 880
Image Classification	ImageNet	ViTAE-T	GFLOPs	3.0	# 174
Image Classification	ImageNet	ViTAE-T-Stage	Top 1 Accuracy	76.8%	# 828
Image Classification	ImageNet	ViTAE-T-Stage	Number of params	4.8M	# 394
Image Classification	ImageNet	ViTAE-T-Stage	GFLOPs	4.6	# 215
Image Classification	ImageNet	ViTAE-S-Stage	Top 1 Accuracy	82.2%	# 510
Image Classification	ImageNet	ViTAE-S-Stage	Number of params	19.2M	# 533
Image Classification	ImageNet	ViTAE-S-Stage	GFLOPs	12.0	# 314
Image Classification	ImageNet	ViTAE-B-Stage	Top 1 Accuracy	83.6%	# 378
Image Classification	ImageNet	ViTAE-B-Stage	Number of params	48.5M	# 716
Image Classification	ImageNet	ViTAE-B-Stage	GFLOPs	27.6	# 387
Image Classification	ImageNet	ViTAE-13M	Top 1 Accuracy	81%	# 614
Image Classification	ImageNet	ViTAE-13M	Number of params	13.2M	# 507
Image Classification	ImageNet	ViTAE-13M	GFLOPs	6.8	# 246

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitae-vision-transformer-advanced-by/video-object-segmentation-on-davis-2017)](https://paperswithcode.com/sota/video-object-segmentation-on-davis-2017?p=vitae-vision-transformer-advanced-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitae-vision-transformer-advanced-by/video-object-segmentation-on-davis-2016)](https://paperswithcode.com/sota/video-object-segmentation-on-davis-2016?p=vitae-vision-transformer-advanced-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitae-vision-transformer-advanced-by/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=vitae-vision-transformer-advanced-by)`

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

NeurIPS 2021 · Yufei Xu, Qiming Zhang, Jing Zhang, DaCheng Tao ·

Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and is able to learn robust feature representation for objects at various scales. Moreover, in each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network. Consequently, it has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. Experiments on ImageNet as well as downstream tasks prove the superiority of ViTAE over the baseline transformer and concurrent works. Source code and pretrained models will be available at GitHub.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Code

Add Remove Mark official

Annbless/ViTAE official

104

ViTAE-Transformer/ViTAE-Transformer

240

Tasks

Add Remove

Image Classification

Inductive Bias

Object Detection

Video Object Segmentation

Datasets

CIFAR-10

ImageNet

CIFAR-100

Oxford 102 Flower

DAVIS

DAVIS 2017

DAVIS 2016

Results from the Paper

Edit

Ranked #2 on Video Object Segmentation on DAVIS 2017

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Object Segmentation	DAVIS 2016	ViTAE-T-Stage	Jaccard (Mean)	89.2	# 11	Compare
			F-Score	90.4	# 15	Compare
			J&F	89.8	# 13	Compare
Video Object Segmentation	DAVIS 2017	ViTAE-T-Stage	Jaccard (Mean)	79.4	# 2	Compare
			J&F	82.5	# 1	Compare
			F-Score	85.5	# 2	Compare
Image Classification	ImageNet	ViTAE-6M	Top 1 Accuracy	77.9%	# 792	Compare
			Number of params	6.5M	# 444	Compare
			GFLOPs	4	# 191	Compare
Image Classification	ImageNet	ViTAE-T	Top 1 Accuracy	75.3%	# 880	Compare
Image Classification	ImageNet	ViTAE-T	GFLOPs	3.0	# 174	Compare
Image Classification	ImageNet	ViTAE-T-Stage	Top 1 Accuracy	76.8%	# 828	Compare
			Number of params	4.8M	# 394	Compare
			GFLOPs	4.6	# 215	Compare
Image Classification	ImageNet	ViTAE-S-Stage	Top 1 Accuracy	82.2%	# 510	Compare
			Number of params	19.2M	# 533	Compare
			GFLOPs	12.0	# 314	Compare
Image Classification	ImageNet	ViTAE-B-Stage	Top 1 Accuracy	83.6%	# 378	Compare
			Number of params	48.5M	# 716	Compare
			GFLOPs	27.6	# 387	Compare
Image Classification	ImageNet	ViTAE-13M	Top 1 Accuracy	81%	# 614	Compare
			Number of params	13.2M	# 507	Compare
			GFLOPs	6.8	# 246	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Convolution • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove