TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Object Detection	COCO minival	DETR-ResNet50 with iRPE-K (300 epochs)	box AP	42.3	# 140
Object Detection	COCO minival	DETR-ResNet50 with iRPE-K (150 epochs)	box AP	40.8	# 158
Image Classification	ImageNet	DeiT-S with iRPE-QKV	Top 1 Accuracy	81.4%	# 586
Image Classification	ImageNet	DeiT-S with iRPE-QKV	GFLOPs	9.770	# 295
Image Classification	ImageNet	DeiT-S with iRPE-K	Top 1 Accuracy	80.9%	# 618
Image Classification	ImageNet	DeiT-S with iRPE-K	Number of params	22M	# 557
Image Classification	ImageNet	DeiT-S with iRPE-K	GFLOPs	9.318	# 288
Image Classification	ImageNet	DeiT-Ti with iRPE-K	Top 1 Accuracy	73.7%	# 912
Image Classification	ImageNet	DeiT-Ti with iRPE-K	Number of params	6M	# 437
Image Classification	ImageNet	DeiT-Ti with iRPE-K	GFLOPs	2.568	# 163
Image Classification	ImageNet	DeiT-S with iRPE-QK	Top 1 Accuracy	81.1%	# 607
Image Classification	ImageNet	DeiT-S with iRPE-QK	GFLOPs	9.412	# 290
Image Classification	ImageNet	DeiT-B with iRPE-K	Top 1 Accuracy	82.4%	# 491
Image Classification	ImageNet	DeiT-B with iRPE-K	Number of params	87M	# 822
Image Classification	ImageNet	DeiT-B with iRPE-K	GFLOPs	35.368	# 401

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rethinking-and-improving-relative-position/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=rethinking-and-improving-relative-position)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rethinking-and-improving-relative-position/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=rethinking-and-improving-relative-position)`

Rethinking and Improving Relative Position Encoding for Vision Transformer

ICCV 2021 · Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, Hongyang Chao ·

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

microsoft/cream official

1,558

Tasks

Add Remove

Image Classification

Object Detection

Position

Datasets

ImageNet

MS COCO

Results from the Paper

Edit

Ranked #140 on Object Detection on COCO minival

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Object Detection	COCO minival	DETR-ResNet50 with iRPE-K (300 epochs)	box AP	42.3	# 140	Compare
Object Detection	COCO minival	DETR-ResNet50 with iRPE-K (150 epochs)	box AP	40.8	# 158	Compare
Image Classification	ImageNet	DeiT-S with iRPE-QKV	Top 1 Accuracy	81.4%	# 586	Compare
Image Classification	ImageNet	DeiT-S with iRPE-QKV	GFLOPs	9.770	# 295	Compare
Image Classification	ImageNet	DeiT-S with iRPE-K	Top 1 Accuracy	80.9%	# 618	Compare
			Number of params	22M	# 557	Compare
			GFLOPs	9.318	# 288	Compare
Image Classification	ImageNet	DeiT-Ti with iRPE-K	Top 1 Accuracy	73.7%	# 912	Compare
			Number of params	6M	# 437	Compare
			GFLOPs	2.568	# 163	Compare
Image Classification	ImageNet	DeiT-S with iRPE-QK	Top 1 Accuracy	81.1%	# 607	Compare
Image Classification	ImageNet	DeiT-S with iRPE-QK	GFLOPs	9.412	# 290	Compare
Image Classification	ImageNet	DeiT-B with iRPE-K	Top 1 Accuracy	82.4%	# 491	Compare
			Number of params	87M	# 822	Compare
			GFLOPs	35.368	# 401	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Convolution • DeiT • Dense Connections • Detr • Dropout • Feedforward Network • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Rethinking and Improving Relative Position Encoding for Vision Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove