TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	DiNAT-Mini (UperNet)	Validation mIoU	47.2	# 157
Semantic Segmentation	ADE20K	DiNAT-L (Mask2Former)	Validation mIoU	58.1	# 19
Semantic Segmentation	ADE20K	DiNAT_s-Large (UperNet)	Validation mIoU	54.6	# 53
Semantic Segmentation	ADE20K	DiNAT-Large (UperNet)	Validation mIoU	54.9	# 47
Semantic Segmentation	ADE20K	DiNAT-Base (UperNet)	Validation mIoU	50.4	# 106
Semantic Segmentation	ADE20K	DiNAT-Small (UperNet)	Validation mIoU	49.9	# 116
Semantic Segmentation	ADE20K	DiNAT-Tiny (UperNet)	Validation mIoU	48.8	# 135
Instance Segmentation	ADE20K val	DiNAT-L (Mask2Former, single-scale)	AP	35.4	# 8
Instance Segmentation	ADE20K val	DiNAT-L (Mask2Former, single-scale)	APS	16.3	# 4
Instance Segmentation	ADE20K val	DiNAT-L (Mask2Former, single-scale)	APM	39.0	# 5
Instance Segmentation	ADE20K val	DiNAT-L (Mask2Former, single-scale)	APL	55.5	# 4
Panoptic Segmentation	ADE20K val	DiNAT-L (Mask2Former, 640x640)	PQ	49.4	# 13
Panoptic Segmentation	ADE20K val	DiNAT-L (Mask2Former, 640x640)	AP	35.0	# 10
Panoptic Segmentation	ADE20K val	DiNAT-L (Mask2Former, 640x640)	mIoU	56.3	# 11
Semantic Segmentation	ADE20K val	DiNAT-L (Mask2Former)	mIoU	58.1	# 15
Panoptic Segmentation	Cityscapes val	DiNAT-L (Mask2Former)	PQ	67.2	# 11
Panoptic Segmentation	Cityscapes val	DiNAT-L (Mask2Former)	mIoU	83.4	# 8
Panoptic Segmentation	Cityscapes val	DiNAT-L (Mask2Former)	AP	44.5	# 8
Instance Segmentation	Cityscapes val	DiNAT-L (single-scale, Mask2Former)	mask AP	45.1	# 7
Instance Segmentation	Cityscapes val	DiNAT-L (single-scale, Mask2Former)	AP50	72.6	# 3
Semantic Segmentation	Cityscapes val	DiNAT-L (Mask2Former)	mIoU	84.5	# 13
Instance Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	mask AP	50.8	# 20
Instance Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	AP50	75.0	# 4
Panoptic Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	PQ	58.5	# 4
Panoptic Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	PQth	64.9	# 3
Panoptic Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	PQst	48.8	# 2
Panoptic Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	AP	49.2	# 4
Panoptic Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	mIoU	68.3	# 2
Image Classification	ImageNet	DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224)	Top 1 Accuracy	86.5%	# 135
Image Classification	ImageNet	DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224)	GFLOPs	34.5	# 400
Image Classification	ImageNet	DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224)	Top 1 Accuracy	87.5%	# 86
Image Classification	ImageNet	DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224)	Number of params	200M	# 901
Image Classification	ImageNet	DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224)	GFLOPs	92.4	# 444
Image Classification	ImageNet	DiNAT_s-Large (384res; Pretrained on IN22K@224)	Top 1 Accuracy	87.4%	# 93
Image Classification	ImageNet	DiNAT_s-Large (384res; Pretrained on IN22K@224)	Number of params	197M	# 897
Image Classification	ImageNet	DiNAT_s-Large (384res; Pretrained on IN22K@224)	GFLOPs	101.5	# 448
Image Classification	ImageNet	DiNAT-Mini	Top 1 Accuracy	81.8%	# 553
Image Classification	ImageNet	DiNAT-Mini	Number of params	20M	# 536
Image Classification	ImageNet	DiNAT-Mini	GFLOPs	2.7	# 167
Image Classification	ImageNet	DiNAT-Base	Top 1 Accuracy	84.4%	# 299
Image Classification	ImageNet	DiNAT-Base	Number of params	90M	# 847
Image Classification	ImageNet	DiNAT-Base	GFLOPs	13.7	# 330
Image Classification	ImageNet	DiNAT-Small	Top 1 Accuracy	83.8%	# 358
Image Classification	ImageNet	DiNAT-Small	Number of params	51M	# 729
Image Classification	ImageNet	DiNAT-Small	GFLOPs	7.8	# 261
Image Classification	ImageNet	DiNAT-Tiny	Top 1 Accuracy	82.7%	# 465
Image Classification	ImageNet	DiNAT-Tiny	Number of params	28M	# 629
Image Classification	ImageNet	DiNAT-Tiny	GFLOPs	4.3	# 202
Image Classification	ImageNet	DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224)	Top 1 Accuracy	87.4%	# 93
Image Classification	ImageNet	DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224)	GFLOPs	89.7	# 443

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/panoptic-segmentation-on-coco-minival)](https://paperswithcode.com/sota/panoptic-segmentation-on-coco-minival?p=dilated-neighborhood-attention-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/instance-segmentation-on-cityscapes-val)](https://paperswithcode.com/sota/instance-segmentation-on-cityscapes-val?p=dilated-neighborhood-attention-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/instance-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/instance-segmentation-on-ade20k-val?p=dilated-neighborhood-attention-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/panoptic-segmentation-on-cityscapes-val)](https://paperswithcode.com/sota/panoptic-segmentation-on-cityscapes-val?p=dilated-neighborhood-attention-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/panoptic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/panoptic-segmentation-on-ade20k-val?p=dilated-neighborhood-attention-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/semantic-segmentation-on-cityscapes-val)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes-val?p=dilated-neighborhood-attention-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=dilated-neighborhood-attention-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=dilated-neighborhood-attention-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=dilated-neighborhood-attention-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dilated-neighborhood-attention-transformer/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=dilated-neighborhood-attention-transformer)`

Dilated Neighborhood Attention Transformer

29 Sep 2022 · Ali Hassani, Humphrey Shi ·

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).

PDF Abstract

Code

Add Remove Mark official

SHI-Labs/Neighborhood-Attention-Tra… official

996

shi-labs/natten official

276

huggingface/transformers

124,889

leondgarse/keras_cv_attention_models

556

alexmehta/nac-tcn-tcns-with-causal-…

Tasks

Add Remove

Image Classification

Instance Segmentation

Object Detection

Panoptic Segmentation

Segmentation

Semantic Segmentation

Datasets

ImageNet

MS COCO

Cityscapes

ADE20K

Results from the Paper

Edit

Ranked #4 on Panoptic Segmentation on COCO minival

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	DiNAT-Mini (UperNet)	Validation mIoU	47.2	# 157	Compare
Semantic Segmentation	ADE20K	DiNAT-L (Mask2Former)	Validation mIoU	58.1	# 19	Compare
Semantic Segmentation	ADE20K	DiNAT_s-Large (UperNet)	Validation mIoU	54.6	# 53	Compare
Semantic Segmentation	ADE20K	DiNAT-Large (UperNet)	Validation mIoU	54.9	# 47	Compare
Semantic Segmentation	ADE20K	DiNAT-Base (UperNet)	Validation mIoU	50.4	# 106	Compare
Semantic Segmentation	ADE20K	DiNAT-Small (UperNet)	Validation mIoU	49.9	# 116	Compare
Semantic Segmentation	ADE20K	DiNAT-Tiny (UperNet)	Validation mIoU	48.8	# 135	Compare
Instance Segmentation	ADE20K val	DiNAT-L (Mask2Former, single-scale)	AP	35.4	# 8	Compare
			APS	16.3	# 4	Compare
			APM	39.0	# 5	Compare
			APL	55.5	# 4	Compare
Panoptic Segmentation	ADE20K val	DiNAT-L (Mask2Former, 640x640)	PQ	49.4	# 13	Compare
			AP	35.0	# 10	Compare
			mIoU	56.3	# 11	Compare
Semantic Segmentation	ADE20K val	DiNAT-L (Mask2Former)	mIoU	58.1	# 15	Compare
Panoptic Segmentation	Cityscapes val	DiNAT-L (Mask2Former)	PQ	67.2	# 11	Compare
			mIoU	83.4	# 8	Compare
			AP	44.5	# 8	Compare
Instance Segmentation	Cityscapes val	DiNAT-L (single-scale, Mask2Former)	mask AP	45.1	# 7	Compare
Instance Segmentation	Cityscapes val	DiNAT-L (single-scale, Mask2Former)	AP50	72.6	# 3	Compare
Semantic Segmentation	Cityscapes val	DiNAT-L (Mask2Former)	mIoU	84.5	# 13	Compare
Instance Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	mask AP	50.8	# 20	Compare
Instance Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	AP50	75.0	# 4	Compare
Panoptic Segmentation	COCO minival	DiNAT-L (single-scale, Mask2Former)	PQ	58.5	# 4	Compare
			PQth	64.9	# 3	Compare
			PQst	48.8	# 2	Compare
			AP	49.2	# 4	Compare
			mIoU	68.3	# 2	Compare
Image Classification	ImageNet	DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224)	Top 1 Accuracy	86.5%	# 135	Compare
Image Classification	ImageNet		GFLOPs	34.5	# 400	Compare
Image Classification	ImageNet	DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224)	Top 1 Accuracy	87.5%	# 86	Compare
			Number of params	200M	# 901	Compare
			GFLOPs	92.4	# 444	Compare
Image Classification	ImageNet	DiNAT_s-Large (384res; Pretrained on IN22K@224)	Top 1 Accuracy	87.4%	# 93	Compare
			Number of params	197M	# 897	Compare
			GFLOPs	101.5	# 448	Compare
Image Classification	ImageNet	DiNAT-Mini	Top 1 Accuracy	81.8%	# 553	Compare
			Number of params	20M	# 536	Compare
			GFLOPs	2.7	# 167	Compare
Image Classification	ImageNet	DiNAT-Base	Top 1 Accuracy	84.4%	# 299	Compare
			Number of params	90M	# 847	Compare
			GFLOPs	13.7	# 330	Compare
Image Classification	ImageNet	DiNAT-Small	Top 1 Accuracy	83.8%	# 358	Compare
			Number of params	51M	# 729	Compare
			GFLOPs	7.8	# 261	Compare
Image Classification	ImageNet	DiNAT-Tiny	Top 1 Accuracy	82.7%	# 465	Compare
			Number of params	28M	# 629	Compare
			GFLOPs	4.3	# 202	Compare
Image Classification	ImageNet	DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224)	Top 1 Accuracy	87.4%	# 93	Compare
Image Classification	ImageNet	DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224)	GFLOPs	89.7	# 443	Compare

Methods

Add Remove

ConvNeXt • Neighborhood Attention • Transformer

Edit Social Preview

Dilated Neighborhood Attention Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove