Swin Transformer V2: Scaling Up Capacity and Resolution

We present techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to 1,536$\times$1,536 resolution. By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet-V2 image classification, 63.1/54.4 box/mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification... Our techniques are generally applicable for scaling up vision models, which has not been widely explored as that of NLP language models, partly due to the following difficulties in training and applications: 1) vision models often face instability issues at scale and 2) many downstream vision tasks require high resolution images or windows and it is not clear how to effectively transfer models pre-trained at low resolutions to higher resolution ones. The GPU memory consumption is also a problem when the image resolution is high. To address these issues, we present several techniques, which are illustrated by using Swin Transformer as a case study: 1) a post normalization technique and a scaled cosine attention approach to improve the stability of large vision models; 2) a log-spaced continuous position bias technique to effectively transfer models pre-trained at low-resolution images and windows to their higher-resolution counterparts. In addition, we share our crucial implementation details that lead to significant savings of GPU memory consumption and thus make it feasible to train large vision models with regular GPUs. Using these techniques and self-supervised pre-training, we successfully train a strong 3B Swin Transformer model and effectively transfer it to various vision tasks involving high-resolution images or windows, achieving the state-of-the-art accuracy on a variety of benchmarks. read more

PDF Abstract

Results from the Paper

 Ranked #1 on Action Classification on Kinetics-400 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K SwinV2-G (UperNet, multiview) Validation mIoU 59.9 # 1
Semantic Segmentation ADE20K SwinV2-G (UperNet) Validation mIoU 59.3 # 2
Instance Segmentation COCO test-dev SwinV2-G (HTC++) mask AP 54.4 # 1
Object Detection COCO test-dev SwinV2-G (HTC++) box AP 63.1 # 1
Image Classification ImageNet SwinV2-G Top 1 Accuracy 90.17% # 6
Image Classification ImageNet V2 SwinV2-G Top 1 Accuracy 84.00 # 1
Action Classification Kinetics-400 Video-SwinV2-G (IN-22K-ext-70M) Vid acc@1 86.8 # 1