Dilated Neighborhood Attention Transformer

29 Sep 2022  ·  Ali Hassani, Humphrey Shi ·

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K DiNAT-Mini (UperNet) Validation mIoU 47.2 # 157
Semantic Segmentation ADE20K DiNAT-L (Mask2Former) Validation mIoU 58.1 # 19
Semantic Segmentation ADE20K DiNAT_s-Large (UperNet) Validation mIoU 54.6 # 53
Semantic Segmentation ADE20K DiNAT-Large (UperNet) Validation mIoU 54.9 # 47
Semantic Segmentation ADE20K DiNAT-Base (UperNet) Validation mIoU 50.4 # 106
Semantic Segmentation ADE20K DiNAT-Small (UperNet) Validation mIoU 49.9 # 116
Semantic Segmentation ADE20K DiNAT-Tiny (UperNet) Validation mIoU 48.8 # 135
Instance Segmentation ADE20K val DiNAT-L (Mask2Former, single-scale) AP 35.4 # 8
APS 16.3 # 4
APM 39.0 # 5
APL 55.5 # 4
Panoptic Segmentation ADE20K val DiNAT-L (Mask2Former, 640x640) PQ 49.4 # 13
AP 35.0 # 10
mIoU 56.3 # 11
Semantic Segmentation ADE20K val DiNAT-L (Mask2Former) mIoU 58.1 # 15
Panoptic Segmentation Cityscapes val DiNAT-L (Mask2Former) PQ 67.2 # 11
mIoU 83.4 # 8
AP 44.5 # 8
Instance Segmentation Cityscapes val DiNAT-L (single-scale, Mask2Former) mask AP 45.1 # 7
AP50 72.6 # 3
Semantic Segmentation Cityscapes val DiNAT-L (Mask2Former) mIoU 84.5 # 13
Instance Segmentation COCO minival DiNAT-L (single-scale, Mask2Former) mask AP 50.8 # 20
AP50 75.0 # 4
Panoptic Segmentation COCO minival DiNAT-L (single-scale, Mask2Former) PQ 58.5 # 4
PQth 64.9 # 3
PQst 48.8 # 2
AP 49.2 # 4
mIoU 68.3 # 2
Image Classification ImageNet DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224) Top 1 Accuracy 86.5% # 135
GFLOPs 34.5 # 400
Image Classification ImageNet DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224) Top 1 Accuracy 87.5% # 86
Number of params 200M # 901
GFLOPs 92.4 # 444
Image Classification ImageNet DiNAT_s-Large (384res; Pretrained on IN22K@224) Top 1 Accuracy 87.4% # 93
Number of params 197M # 897
GFLOPs 101.5 # 448
Image Classification ImageNet DiNAT-Mini Top 1 Accuracy 81.8% # 553
Number of params 20M # 536
GFLOPs 2.7 # 167
Image Classification ImageNet DiNAT-Base Top 1 Accuracy 84.4% # 299
Number of params 90M # 847
GFLOPs 13.7 # 330
Image Classification ImageNet DiNAT-Small Top 1 Accuracy 83.8% # 358
Number of params 51M # 729
GFLOPs 7.8 # 261
Image Classification ImageNet DiNAT-Tiny Top 1 Accuracy 82.7% # 465
Number of params 28M # 629
GFLOPs 4.3 # 202
Image Classification ImageNet DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224) Top 1 Accuracy 87.4% # 93
GFLOPs 89.7 # 443

Methods