Rethinking and Improving Relative Position Encoding for Vision Transformer

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Object Detection COCO minival DETR-ResNet50 with iRPE-K (300 epochs) box AP 42.3 # 140
Object Detection COCO minival DETR-ResNet50 with iRPE-K (150 epochs) box AP 40.8 # 158
Image Classification ImageNet DeiT-S with iRPE-QKV Top 1 Accuracy 81.4% # 586
GFLOPs 9.770 # 295
Image Classification ImageNet DeiT-S with iRPE-K Top 1 Accuracy 80.9% # 618
Number of params 22M # 557
GFLOPs 9.318 # 288
Image Classification ImageNet DeiT-Ti with iRPE-K Top 1 Accuracy 73.7% # 912
Number of params 6M # 437
GFLOPs 2.568 # 163
Image Classification ImageNet DeiT-S with iRPE-QK Top 1 Accuracy 81.1% # 607
GFLOPs 9.412 # 290
Image Classification ImageNet DeiT-B with iRPE-K Top 1 Accuracy 82.4% # 491
Number of params 87M # 822
GFLOPs 35.368 # 401

Methods