ELSA: Enhanced Local Self-Attention for Vision Transformer

23 Dec 2021  ·  Jingkai Zhou, Pichao Wang, Fan Wang, Qiong Liu, Hao Li, Rong Jin ·

Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: \emph{channel setting} and \emph{spatial processing}. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture / hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer \cite{swin} by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO \cite{volo} from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. Code is available at \url{https://github.com/damo-cv/ELSA}.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K ELSA-Swin-S Validation mIoU 50.3 # 107
Semantic Segmentation ADE20K val ELSA-Swin-S mIoU 50.3 # 46
Instance Segmentation COCO minival ELSA-S (Mask RCNN) mask AP 43.0 # 59
AP50 67.3 # 8
AP75 46.4 # 9
Instance Segmentation COCO minival ELSA-S (Cascade Mask RCNN) mask AP 44.4 # 52
AP50 67.8 # 7
AP75 47.8 # 8
Object Detection COCO minival ELSA-S (Mask RCNN) box AP 48.3 # 85
AP50 70.4 # 18
AP75 52.9 # 22
Object Detection COCO minival ELSA-S (Cascade Mask RCNN) box AP 51.6 # 69
AP50 70.5 # 17
AP75 56.0 # 12
Image Classification ImageNet ELSA-VOLO-D1 Top 1 Accuracy 84.7% # 277
Number of params 27M # 610
GFLOPs 8 # 267
Image Classification ImageNet ELSA-VOLO-D5 (512*512) Top 1 Accuracy 87.2% # 101
Number of params 298M # 907
GFLOPs 437 # 481
Image Classification ImageNet ELSA-Swin-T Top 1 Accuracy 82.7% # 460
Number of params 28M # 624
GFLOPs 4.8 # 226

Methods