Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

25 Mar 2021 Ze Liu Yutong Lin Yue Cao Han Hu Yixuan Wei Zheng Zhang Stephen Lin Baining Guo

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Semantic Segmentation ADE20K Swin-L (UperNet, ImageNet-22k pretrain) Validation mIoU 53.50 # 1
Test Score 62.8 # 1
Semantic Segmentation ADE20K val Swin-L (UperNet, ImageNet-22k pretrain) mIoU 53.5 # 1
Object Detection COCO minival Swin-L (HTC++, multi scale) box AP 58 # 1
Object Detection COCO minival Swin-L (HTC++) box AP 57.1 # 2
Instance Segmentation COCO minival Swin-L (HTC++) mask AP 49.5 # 2
Instance Segmentation COCO minival Swin-L (HTC++, multi scale) mask AP 50.4 # 1
Instance Segmentation COCO test-dev Swin-L (HTC++, single scale) mask AP 50.2 # 2
Object Detection COCO test-dev Swin-L (HTC++, multi scale) box AP 58.7 # 1
Instance Segmentation COCO test-dev Swin-L (HTC++, multi scale) mask AP 51.1 # 1
Object Detection COCO test-dev Swin-L (HTC++, single scale) box AP 57.7 # 2
Image Classification ImageNet Swin-L (384 res, ImageNet-22k pretrain) Top 1 Accuracy 86.4% # 18
Image Classification ImageNet Swin-B (384 res, ImageNet-22k pretrain) Top 1 Accuracy 86% # 27
Image Classification ImageNet Swin-B (384x384 res) Top 1 Accuracy 84.2% # 68

Methods used in the Paper


METHOD TYPE
GELU
Activation Functions
BPE
Subword Segmentation
Adam
Stochastic Optimization
Dense Connections
Feedforward Networks
Softmax
Output Functions
Dropout
Regularization
Residual Connection
Skip Connections
Layer Normalization
Normalization
Label Smoothing
Regularization
Multi-Head Attention
Attention Modules
Scaled Dot-Product Attention
Attention Mechanisms
Transformer
Transformers