Recently, zero-shot image classification by vision-language pre-training has demonstrated incredible achievements, that the model can classify arbitrary category without seeing additional annotated images of that category.
Our techniques are generally applicable for scaling up vision models, which has not been widely explored as that of NLP language models, partly due to the following difficulties in training and applications: 1) vision models often face instability issues at scale and 2) many downstream vision tasks require high resolution images or windows and it is not clear how to effectively transfer models pre-trained at low resolutions to higher resolution ones.
Ranked #1 on Object Detection on COCO test-dev (using extra training data)
We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by $40\times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.
We introduce MixTraining, a new training paradigm for object detection that can improve the performance of existing detectors for free.
We are witnessing a modeling shift from CNN to Transformers in computer vision.
Ranked #35 on Semantic Segmentation on ADE20K
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
Ranked #3 on Semantic Segmentation on FoodSeg103 (using extra training data)
We argue that the power of contrastive learning has yet to be fully unleashed, as current methods are trained only on instance-level pretext tasks, leading to representations that may be sub-optimal for downstream tasks requiring dense pixel predictions.
This paper presents parametric instance classification (PIC) for unsupervised visual feature learning.
This paper introduces a negative margin loss to metric learning based few-shot learning methods.