DaViT: Dual Attention Vision Transformers

7 Apr 2022  ·  Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan ·

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K DaViT-T Validation mIoU 46.3 # 163
Semantic Segmentation ADE20K DaViT-B Validation mIoU 49.4 # 119
Semantic Segmentation ADE20K val DaViT-B (UperNet) mIoU 46.3 # 65
Semantic Segmentation ADE20K val DaViT-S (UperNet) mIoU 48.8 # 56
Object Detection COCO minival DaViT-B (Mask R-CNN, 36 epochs) box AP 49.9 # 76
Instance Segmentation COCO minival DaViT-B (Mask R-CNN, 36 epochs) mask AP 44.6 # 46
Instance Segmentation COCO minival DaViT-T (Mask R-CNN, 36 epochs) mask AP 42.9 # 60
Instance Segmentation COCO minival DaViT-S (Mask R-CNN, 36 epochs) mask AP 44.3 # 52
Object Detection COCO minival DaViT-T (Mask R-CNN, 36 epochs) box AP 47.4 # 90
Object Detection COCO minival DaViT-S (Mask R-CNN, 36 epochs) box AP 49.5 # 78
Image Classification ImageNet DaViT-T Top 1 Accuracy 82.8% # 436
Number of params 28.3M # 612
GFLOPs 4.5 # 207
Image Classification ImageNet DaViT-G Top 1 Accuracy 90.4% # 12
Number of params 1437M # 930
GFLOPs 1038 # 483
Image Classification ImageNet DaViT-L (ImageNet-22k) Top 1 Accuracy 87.5% # 83
Number of params 196.8M # 867
GFLOPs 103 # 445
Image Classification ImageNet DaViT-B (ImageNet-22k) Top 1 Accuracy 86.9% # 112
Number of params 87.9M # 801
GFLOPs 46.4 # 411
Image Classification ImageNet DaViT-S Top 1 Accuracy 84.2% # 299
Number of params 49.7M # 695
GFLOPs 8.8 # 278
Image Classification ImageNet DaViT-H Top 1 Accuracy 90.2% # 13
Number of params 362M # 897
GFLOPs 334 # 471
Image Classification ImageNet DaViT-B Top 1 Accuracy 84.6% # 276
Number of params 87.9M # 801
GFLOPs 15.5 # 336

Methods