This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
Ranked #3 on Semantic Segmentation on FoodSeg103 (using extra training data)
We build a family of models that surpass existing MLPs and achieve a comparable accuracy (83. 2%) on ImageNet-1K classification compared to the state-of-the-art Transformer such as Swin Transformer (83. 3%) but using fewer parameters and FLOPs.
Ranked #147 on Image Classification on ImageNet