Dilated Neighborhood Attention Transformer

29 Sep 2022  ·  Ali Hassani, Humphrey Shi ·

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.5% box AP in COCO object detection, 1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.2 PQ) and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data). We open-source our project.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation ADE20K DiNAT-Base (UperNet) Validation mIoU 50.4 # 63
Semantic Segmentation ADE20K DiNAT_s-Large (UperNet) Validation mIoU 54.6 # 34
Semantic Segmentation ADE20K DiNAT-Mini (UperNet) Validation mIoU 47.2 # 105
Semantic Segmentation ADE20K DiNAT-Tiny (UperNet) Validation mIoU 48.8 # 85
Semantic Segmentation ADE20K DiNAT-L (Mask2Former) Validation mIoU 58.2 # 11
Semantic Segmentation ADE20K DiNAT-Small (UperNet) Validation mIoU 49.9 # 69
Semantic Segmentation ADE20K DiNAT-Large (UperNet) Validation mIoU 54.6 # 34
Instance Segmentation ADE20K val DiNAT-L (Mask2Former, single-scale) AP 35.2 # 3
APS 16.4 # 1
APM 40.4 # 1
APL 54.9 # 1
Semantic Segmentation ADE20K val DiNAT-L (Mask2Former) mIoU 58.2 # 8
Panoptic Segmentation ADE20K val DiNAT-L (Mask2Former) PQ 48.5 # 5
AP 34.4 # 5
mIoU 56.2 # 5
Panoptic Segmentation Cityscapes val DiNAT-L (Mask2Former) PQ 66.9 # 9
mIoU 83.2 # 6
AP 43.8 # 8
Semantic Segmentation Cityscapes val DiNAT-L (Mask2Former) mIoU 84.5 # 6
Instance Segmentation Cityscapes val DiNAT-L (single-scale, Mask2Former) mask AP 44.5 # 3
AP50 72.2 # 1
Instance Segmentation COCO minival DiNAT-L (single-scale, Mask2Former) mask AP 50.7 # 13
AP50 74.8 # 2
Panoptic Segmentation COCO minival DiNAT-L (single-scale, Mask2Former) PQ 58.2 # 2
PQth 64.7 # 2
PQst 48.4 # 4
AP 49.2 # 1
mIoU 68.1 # 1
Image Classification ImageNet DiNAT-Base Top 1 Accuracy 84.4% # 207
Number of params 90M # 682
GFLOPs 13.7 # 291
Image Classification ImageNet DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224) Top 1 Accuracy 87.18% # 70
GFLOPs 89.7 # 379
Image Classification ImageNet DiNAT-Small Top 1 Accuracy 83.8% # 251
Number of params 51M # 578
GFLOPs 7.8 # 235
Image Classification ImageNet DiNAT-Tiny Top 1 Accuracy 82.7% # 341
Number of params 28M # 494
GFLOPs 4.3 # 180
Image Classification ImageNet DiNAT-Mini Top 1 Accuracy 81.8% # 412
Number of params 20M # 417
GFLOPs 2.7 # 154
Image Classification ImageNet DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224) Top 1 Accuracy 86.5% # 96
GFLOPs 34.5 # 347
Image Classification ImageNet DiNAT_s-Large (384res; Pretrained on IN22K@224) Top 1 Accuracy 87.4% # 63
Number of params 197M # 722
GFLOPs 101.5 # 384
Image Classification ImageNet DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224) Top 1 Accuracy 87.31% # 65
Number of params 200M # 725
GFLOPs 92.4 # 380

Methods