TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	Swin-L (UperNet, ImageNet-22k pretrain)	Validation mIoU	53.50	# 74
Semantic Segmentation	ADE20K	Swin-L (UperNet, ImageNet-22k pretrain)	Test Score	62.8	# 1
Semantic Segmentation	ADE20K	Swin-B (UperNet, ImageNet-1k pretrain)	Validation mIoU	49.7	# 119
Semantic Segmentation	ADE20K val	Swin-B (UperNet, ImageNet-1k pretrain)	mIoU	49.7	# 50
Semantic Segmentation	ADE20K val	Swin-L (UperNet, ImageNet-22k pretrain)	mIoU	53.5	# 36
Object Detection	COCO minival	Swin-L (HTC++, single scale)	box AP	57.1	# 39
Instance Segmentation	COCO minival	Swin-L (HTC++, multi scale)	mask AP	50.4	# 22
Object Detection	COCO minival	Swin-L (HTC++, multi scale)	box AP	58	# 35
Instance Segmentation	COCO minival	Swin-L (HTC++, single scale)	mask AP	49.5	# 25
Instance Segmentation	COCO test-dev	Swin-L (HTC++, single scale)	mask AP	50.2	# 20
Instance Segmentation	COCO test-dev	Swin-L (HTC++, multi scale)	mask AP	51.1	# 18
Object Detection	COCO test-dev	Swin-L (HTC++, multi scale)	box mAP	58.7	# 30
Object Detection	COCO test-dev	Swin-L (HTC++, single scale)	box mAP	57.7	# 31
Semantic Segmentation	FoodSeg103	Swin-Transformer (Swin-Small)	mIoU	41.6	# 4
Image Classification	ImageNet	Swin-L	Top 1 Accuracy	87.3%	# 99
Image Classification	ImageNet	Swin-L	Number of params	197M	# 897
Image Classification	ImageNet	Swin-L	GFLOPs	103.9	# 452
Image Classification	ImageNet	Swin-B	Top 1 Accuracy	86.4%	# 143
Image Classification	ImageNet	Swin-B	Number of params	88M	# 832
Image Classification	ImageNet	Swin-B	GFLOPs	47	# 419
Image Classification	ImageNet	Swin-T	Top 1 Accuracy	81.3%	# 595
Image Classification	ImageNet	Swin-T	Number of params	29M	# 641
Image Classification	ImageNet	Swin-T	GFLOPs	4.5	# 211
Thermal Image Segmentation	MFN Dataset	SwinT	mIOU	49.0	# 34
Instance Segmentation	Occluded COCO	Swin-B + Cascade Mask R-CNN	Mean Recall	62.90	# 2
Instance Segmentation	Occluded COCO	Swin-T + Mask R-CNN	Mean Recall	58.81	# 6
Instance Segmentation	Occluded COCO	Swin-S + Mask R-CNN	Mean Recall	61.14	# 5
Image Classification	OmniBenchmark	SwinTransformer	Average Top-1 Accuracy	46.4	# 2
Instance Segmentation	Separated COCO	Swin-S + Mask R-CNN	Mean Recall	33.67	# 5
Instance Segmentation	Separated COCO	Swin-B + Cascade Mask R-CNN	Mean Recall	36.31	# 2
Instance Segmentation	Separated COCO	Swin-T + Mask R-CNN	Mean Recall	31.94	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/instance-segmentation-on-occluded-coco)](https://paperswithcode.com/sota/instance-segmentation-on-occluded-coco?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/image-classification-on-omnibenchmark)](https://paperswithcode.com/sota/image-classification-on-omnibenchmark?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/instance-segmentation-on-separated-coco)](https://paperswithcode.com/sota/instance-segmentation-on-separated-coco?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/semantic-segmentation-on-foodseg103)](https://paperswithcode.com/sota/semantic-segmentation-on-foodseg103?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/thermal-image-segmentation-on-mfn-dataset)](https://paperswithcode.com/sota/thermal-image-segmentation-on-mfn-dataset?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=swin-transformer-hierarchical-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-hierarchical-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=swin-transformer-hierarchical-vision)`

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

ICCV 2021 · Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo ·

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

microsoft/Swin-Transformer official

↳ Quickstart in

Spaces

12,925

huggingface/transformers

124,793

rwightman/pytorch-image-models

29,713

open-mmlab/mmdetection

27,744

pytorch/vision

15,422

See all 69 implementations

Tasks

Add Remove

Image Classification

Instance Segmentation

Object Detection

Real-Time Object Detection

Semantic Segmentation

Thermal Image Segmentation

Datasets

ImageNet

MS COCO

ADE20K MFNet

OmniBenchmark

FoodSeg103

Separated COCO

Occluded COCO

Results from the Paper

Edit

Ranked #2 on Image Classification on OmniBenchmark

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	Swin-L (UperNet, ImageNet-22k pretrain)	Validation mIoU	53.50	# 74	Compare
Semantic Segmentation	ADE20K	Swin-L (UperNet, ImageNet-22k pretrain)	Test Score	62.8	# 1	Compare
Semantic Segmentation	ADE20K	Swin-B (UperNet, ImageNet-1k pretrain)	Validation mIoU	49.7	# 119	Compare
Semantic Segmentation	ADE20K val	Swin-B (UperNet, ImageNet-1k pretrain)	mIoU	49.7	# 50	Compare
Semantic Segmentation	ADE20K val	Swin-L (UperNet, ImageNet-22k pretrain)	mIoU	53.5	# 36	Compare
Object Detection	COCO minival	Swin-L (HTC++, single scale)	box AP	57.1	# 39	Compare
Instance Segmentation	COCO minival	Swin-L (HTC++, multi scale)	mask AP	50.4	# 22	Compare
Object Detection	COCO minival	Swin-L (HTC++, multi scale)	box AP	58	# 35	Compare
Instance Segmentation	COCO minival	Swin-L (HTC++, single scale)	mask AP	49.5	# 25	Compare
Instance Segmentation	COCO test-dev	Swin-L (HTC++, single scale)	mask AP	50.2	# 20	Compare
Instance Segmentation	COCO test-dev	Swin-L (HTC++, multi scale)	mask AP	51.1	# 18	Compare
Object Detection	COCO test-dev	Swin-L (HTC++, multi scale)	box mAP	58.7	# 30	Compare
Object Detection	COCO test-dev	Swin-L (HTC++, single scale)	box mAP	57.7	# 31	Compare
Semantic Segmentation	FoodSeg103	Swin-Transformer (Swin-Small)	mIoU	41.6	# 4	Compare
Image Classification	ImageNet	Swin-L	Top 1 Accuracy	87.3%	# 99	Compare
			Number of params	197M	# 897	Compare
			GFLOPs	103.9	# 452	Compare
Image Classification	ImageNet	Swin-B	Top 1 Accuracy	86.4%	# 143	Compare
			Number of params	88M	# 832	Compare
			GFLOPs	47	# 419	Compare
Image Classification	ImageNet	Swin-T	Top 1 Accuracy	81.3%	# 595	Compare
			Number of params	29M	# 641	Compare
			GFLOPs	4.5	# 211	Compare
Thermal Image Segmentation	MFN Dataset	SwinT	mIOU	49.0	# 34	Compare
Instance Segmentation	Occluded COCO	Swin-B + Cascade Mask R-CNN	Mean Recall	62.90	# 2	Compare
Instance Segmentation	Occluded COCO	Swin-T + Mask R-CNN	Mean Recall	58.81	# 6	Compare
Instance Segmentation	Occluded COCO	Swin-S + Mask R-CNN	Mean Recall	61.14	# 5	Compare
Image Classification	OmniBenchmark	SwinTransformer	Average Top-1 Accuracy	46.4	# 2	Compare
Instance Segmentation	Separated COCO	Swin-S + Mask R-CNN	Mean Recall	33.67	# 5	Compare
Instance Segmentation	Separated COCO	Swin-B + Cascade Mask R-CNN	Mean Recall	36.31	# 2	Compare
Instance Segmentation	Separated COCO	Swin-T + Mask R-CNN	Mean Recall	31.94	# 6	Compare

Methods

Add Remove

Absolute Position Encodings • AdamW • BPE • Cosine Annealing • Dense Connections • Dropout • GELU • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Stochastic Depth • Swin Transformer • Transformer

Edit Social Preview

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove