Neighborhood Attention Transformer

14 Apr 2022  ·  Ali Hassani, Steven Walton, Jiachen Li, Shen Li, Humphrey Shi ·

We present Neighborhood Attention (NA), the first efficient and scalable sliding-window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding-window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding-window attention, we open source our project and release our checkpoints at: https://github.com/SHI-Labs/Neighborhood-Attention-Transformer .

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation ADE20K NAT-Mini Validation mIoU 46.4 # 132
Params (M) 50 # 30
GFLOPs (512 x 512) 900 # 8
Semantic Segmentation ADE20K NAT-Base Validation mIoU 49.7 # 86
Params (M) 123 # 11
GFLOPs (512 x 512) 1137 # 14
Semantic Segmentation ADE20K NAT-Small Validation mIoU 49.5 # 91
Params (M) 82 # 19
GFLOPs (512 x 512) 1010 # 12
Semantic Segmentation ADE20K NAT-Tiny Validation mIoU 48.4 # 103
Params (M) 58 # 26
GFLOPs (512 x 512) 934 # 9
Image Classification ImageNet NAT-Tiny Top 1 Accuracy 83.2% # 344
Number of params 28M # 523
GFLOPs 4.3 # 184
Image Classification ImageNet NAT-Base Top 1 Accuracy 84.3% # 253
Number of params 90M # 728
GFLOPs 13.7 # 308
Image Classification ImageNet NAT-Small Top 1 Accuracy 83.7% # 301
Number of params 51M # 616
GFLOPs 7.8 # 242
Image Classification ImageNet NAT-Mini Top 1 Accuracy 81.8% # 460
Number of params 20M # 437
GFLOPs 2.7 # 154

Methods