DaViT: Dual Attention Vision Transformers

7 Apr 2022  ·  Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan ·

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.

PDF Abstract

Results from the Paper


Ranked #8 on Image Classification on ImageNet (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K DaViT-T Validation mIoU 46.3 # 97
Semantic Segmentation ADE20K DaViT-B Validation mIoU 49.4 # 61
Semantic Segmentation ADE20K val DaViT-B (UperNet) mIoU 46.3 # 52
Semantic Segmentation ADE20K val DaViT-S (UperNet) mIoU 48.8 # 44
Object Detection COCO minival DaViT-B (Mask R-CNN, 36 epochs) box AP 49.9 # 48
Instance Segmentation COCO minival DaViT-T (Mask R-CNN, 36 epochs) mask AP 42.9 # 40
Instance Segmentation COCO minival DaViT-S (Mask R-CNN, 36 epochs) mask AP 44.3 # 34
Instance Segmentation COCO minival DaViT-B (Mask R-CNN, 36 epochs) mask AP 44.6 # 30
Object Detection COCO minival DaViT-T (Mask R-CNN, 36 epochs) box AP 47.4 # 61
Object Detection COCO minival DaViT-S (Mask R-CNN, 36 epochs) box AP 49.5 # 50
Image Classification ImageNet DaViT-B (ImageNet-22k) Top 1 Accuracy 86.9% # 60
Number of params 87.9M # 572
GFLOPs 46.4 # 307
Image Classification ImageNet DaViT-L (ImageNet-22k) Top 1 Accuracy 87.5% # 42
Number of params 196.8M # 619
GFLOPs 103 # 330
Image Classification ImageNet DaViT-S Top 1 Accuracy 84.2% # 175
Number of params 49.7M # 488
GFLOPs 8.8 # 221
Image Classification ImageNet DaViT-T Top 1 Accuracy 82.8% # 271
Number of params 28.3M # 429
GFLOPs 4.5 # 169
Image Classification ImageNet DaViT-B Top 1 Accuracy 84.6% # 162
GFLOPs 15.5 # 261
Image Classification ImageNet DaViT-G Top 1 Accuracy 90.4% # 8
Number of params 1437M # 670
GFLOPs 1038 # 367
Image Classification ImageNet DaViT-H Top 1 Accuracy 90.2% # 9
Number of params 362M # 640
GFLOPs 334 # 355

Methods