TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	SwinV2-G(UperNet)	Validation mIoU	59.9	# 12
Semantic Segmentation	ADE20K	SwinV2-G-HTC++ Liu et al. ([2021a])	Validation mIoU	53.7	# 68
Instance Segmentation	COCO minival	SwinV2-G (HTC++)	mask AP	53.7	# 7
Object Detection	COCO minival	SwinV2-G (HTC++)	box AP	62.5	# 13
Object Detection	COCO test-dev	SwinV2-G (HTC++)	box mAP	63.1	# 16
Object Detection	COCO test-dev	SwinV2-G (HTC++)	Params (M)	3000	# 1
Instance Segmentation	COCO test-dev	SwinV2-G (HTC++)	mask AP	54.4	# 8
Image Classification	ImageNet	SwinV2-G	Top 1 Accuracy	90.17%	# 15
Image Classification	ImageNet	SwinV2-G	Number of params	3000M	# 973
Image Classification	ImageNet	SwinV2-B	Top 1 Accuracy	87.1%	# 103
Image Classification	ImageNet	SwinV2-B	Number of params	88M	# 832
Image Classification	ImageNet V2	SwinV2-G	Top 1 Accuracy	84.00%	# 4
Image Classification	ImageNet V2	SwinV2-B	Top 1 Accuracy	78.08	# 13
Action Classification	Kinetics-400	Video-SwinV2-G (ImageNet-22k and external 70M pretrain)	Acc@1	86.8	# 38

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-v2-scaling-up-capacity-and/image-classification-on-imagenet-v2)](https://paperswithcode.com/sota/image-classification-on-imagenet-v2?p=swin-transformer-v2-scaling-up-capacity-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-v2-scaling-up-capacity-and/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=swin-transformer-v2-scaling-up-capacity-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-v2-scaling-up-capacity-and/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=swin-transformer-v2-scaling-up-capacity-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-v2-scaling-up-capacity-and/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=swin-transformer-v2-scaling-up-capacity-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-v2-scaling-up-capacity-and/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=swin-transformer-v2-scaling-up-capacity-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-v2-scaling-up-capacity-and/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=swin-transformer-v2-scaling-up-capacity-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-v2-scaling-up-capacity-and/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=swin-transformer-v2-scaling-up-capacity-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swin-transformer-v2-scaling-up-capacity-and/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=swin-transformer-v2-scaling-up-capacity-and)`

Swin Transformer V2: Scaling Up Capacity and Resolution

CVPR 2022 · Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo ·

Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536$\times$1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time. Code is available at \url{https://github.com/microsoft/Swin-Transformer}.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

microsoft/Swin-Transformer official

↳ Quickstart in

Spaces

12,955

rwightman/pytorch-image-models

29,758

PaddlePaddle/PaddleDetection

12,059

towhee-io/towhee

2,991

leondgarse/keras_cv_attention_models

556

See all 19 implementations

Tasks

Add Remove

Action Classification

Image Classification

Instance Segmentation

Object Detection

Semantic Segmentation

Datasets

ImageNet

MS COCO

Kinetics

ADE20K

Kinetics 400

Objects365

Results from the Paper

Edit

Ranked #4 on Image Classification on ImageNet V2 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	SwinV2-G(UperNet)	Validation mIoU	59.9	# 12	Compare
Semantic Segmentation	ADE20K	SwinV2-G-HTC++ Liu et al. ([2021a])	Validation mIoU	53.7	# 68	Compare
Instance Segmentation	COCO minival	SwinV2-G (HTC++)	mask AP	53.7	# 7	Compare
Object Detection	COCO minival	SwinV2-G (HTC++)	box AP	62.5	# 13	Compare
Object Detection	COCO test-dev	SwinV2-G (HTC++)	box mAP	63.1	# 16	Compare
Object Detection	COCO test-dev	SwinV2-G (HTC++)	Params (M)	3000	# 1	Compare
Instance Segmentation	COCO test-dev	SwinV2-G (HTC++)	mask AP	54.4	# 8	Compare
Image Classification	ImageNet	SwinV2-G	Top 1 Accuracy	90.17%	# 15	Compare
Image Classification	ImageNet	SwinV2-G	Number of params	3000M	# 973	Compare
Image Classification	ImageNet	SwinV2-B	Top 1 Accuracy	87.1%	# 103	Compare
Image Classification	ImageNet	SwinV2-B	Number of params	88M	# 832	Compare
Image Classification	ImageNet V2	SwinV2-G	Top 1 Accuracy	84.00%	# 4	Compare
Image Classification	ImageNet V2	SwinV2-B	Top 1 Accuracy	78.08	# 13	Compare
Action Classification	Kinetics-400	Video-SwinV2-G (ImageNet-22k and external 70M pretrain)	Acc@1	86.8	# 38	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Stochastic Depth • Swin Transformer • Transformer

Edit Social Preview

Swin Transformer V2: Scaling Up Capacity and Resolution

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove