TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	CSWin-L (UperNet, ImageNet-22k pretrain)	Validation mIoU	55.70	# 39
Semantic Segmentation	ADE20K val	CSWin-L (UperNet, ImageNet-22k pretrain)	mIoU	55.7	# 25
Image Classification	ImageNet	CSWin-L (384 res,ImageNet-22k pretrain)	Top 1 Accuracy	87.5%	# 86
Image Classification	ImageNet	CSWin-L (384 res,ImageNet-22k pretrain)	Number of params	173M	# 884
Image Classification	ImageNet	CSWin-L (384 res,ImageNet-22k pretrain)	GFLOPs	96.8	# 445

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cswin-transformer-a-general-vision/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=cswin-transformer-a-general-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cswin-transformer-a-general-vision/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=cswin-transformer-a-general-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cswin-transformer-a-general-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=cswin-transformer-a-general-vision)`

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

CVPR 2022 · Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo ·

We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4\% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU. The code and models are available at https://github.com/microsoft/CSWin-Transformer.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

microsoft/CSWin-Transformer official

518

PaddlePaddle/PaddleClas

5,272

BR-IDL/PaddleViT

1,187

TJUdyk/CSWin-Transformer

fogfog2/packnet

See all 6 implementations

Tasks

Add Remove

Image Classification

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Results from the Paper

Edit

Ranked #25 on Semantic Segmentation on ADE20K val

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	CSWin-L (UperNet, ImageNet-22k pretrain)	Validation mIoU	55.70	# 39	Compare
Semantic Segmentation	ADE20K val	CSWin-L (UperNet, ImageNet-22k pretrain)	mIoU	55.7	# 25	Compare
Image Classification	ImageNet	CSWin-L (384 res,ImageNet-22k pretrain)	Top 1 Accuracy	87.5%	# 86	Compare
			Number of params	173M	# 884	Compare
			GFLOPs	96.8	# 445	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Stochastic Depth • Swin Transformer • Transformer

Edit Social Preview

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove