TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	Shift-T	Validation mIoU	46.3	# 169
Semantic Segmentation	ADE20K	Shift-B (UperNet)	Validation mIoU	49.2	# 128
Semantic Segmentation	ADE20K	Shift-B	Validation mIoU	47.9	# 148
Semantic Segmentation	ADE20K	Shift-S	Validation mIoU	47.8	# 149
Object Detection	COCO minival	Shift-T	APM	42.3	# 67
Image Classification	ImageNet	Shift-T	Top 1 Accuracy	81.7%	# 563
Image Classification	ImageNet	Shift-T	Number of params	28M	# 629
Image Classification	ImageNet	Shift-T	GFLOPs	4.4	# 208
Image Classification	ImageNet	Shift-S	Top 1 Accuracy	82.8%	# 453
Image Classification	ImageNet	Shift-S	Number of params	50M	# 725
Image Classification	ImageNet	Shift-S	GFLOPs	8.5	# 278
Image Classification	ImageNet	Shift-B	Top 1 Accuracy	83.3%	# 403
Image Classification	ImageNet	Shift-B	Number of params	88M	# 832
Image Classification	ImageNet	Shift-B	GFLOPs	15.2	# 339

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/when-shift-operation-meets-vision-transformer/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=when-shift-operation-meets-vision-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/when-shift-operation-meets-vision-transformer/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=when-shift-operation-meets-vision-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/when-shift-operation-meets-vision-transformer/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=when-shift-operation-meets-vision-transformer)`

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

26 Jan 2022 · Guangting Wang, Yucheng Zhao, Chuanxin Tang, Chong Luo, Wenjun Zeng ·

Attention mechanism has been widely believed as the key to success of vision transformers (ViTs), since it provides a flexible and powerful way to model spatial relationships. However, is the attention mechanism truly an indispensable part of ViT? Can it be replaced by some other alternatives? To demystify the role of attention mechanism, we simplify it into an extremely simple case: ZERO FLOP and ZERO parameter. Concretely, we revisit the shift operation. It does not contain any parameter or arithmetic calculation. The only operation is to exchange a small portion of the channels between neighboring features. Based on this simple operation, we construct a new backbone network, namely ShiftViT, where the attention layers in ViT are substituted by shift operations. Surprisingly, ShiftViT works quite well in several mainstream tasks, e.g., classification, detection, and segmentation. The performance is on par with or even better than the strong baseline Swin Transformer. These results suggest that the attention mechanism might not be the vital factor that makes ViT successful. It can be even replaced by a zero-parameter operation. We should pay more attentions to the remaining parts of ViT in the future work. Code is available at github.com/microsoft/SPACH.

PDF Abstract

Code

Add Remove Mark official

microsoft/SPACH official

191

keras-team/keras-io

2,651

Tasks

Add Remove

Image Classification

Object Detection

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Results from the Paper

Edit

Ranked #67 on Object Detection on COCO minival (APM metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	Shift-T	Validation mIoU	46.3	# 169	Compare
Semantic Segmentation	ADE20K	Shift-B (UperNet)	Validation mIoU	49.2	# 128	Compare
Semantic Segmentation	ADE20K	Shift-B	Validation mIoU	47.9	# 148	Compare
Semantic Segmentation	ADE20K	Shift-S	Validation mIoU	47.8	# 149	Compare
Object Detection	COCO minival	Shift-T	APM	42.3	# 67	Compare
Image Classification	ImageNet	Shift-T	Top 1 Accuracy	81.7%	# 563	Compare
			Number of params	28M	# 629	Compare
			GFLOPs	4.4	# 208	Compare
Image Classification	ImageNet	Shift-S	Top 1 Accuracy	82.8%	# 453	Compare
			Number of params	50M	# 725	Compare
			GFLOPs	8.5	# 278	Compare
Image Classification	ImageNet	Shift-B	Top 1 Accuracy	83.3%	# 403	Compare
			Number of params	88M	# 832	Compare
			GFLOPs	15.2	# 339	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Stochastic Depth • Swin Transformer • Transformer

Edit Social Preview

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove