Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.
Ranked #1 on Image-to-Text Retrieval on Flickr30k
CoMatch jointly learns two representations of the training data, their class probabilities and low-dimensional embeddings.
We propose momentum prototypes (MoPro), a simple contrastive learning method that achieves online label noise correction, out-of-distribution sample removal, and representation learning.
Ranked #6 on Image Classification on WebVision-1000
Specifically, we systematically investigate performance drop of the state-of-the-art two-stage instance segmentation model Mask R-CNN on the recent long-tail LVIS dataset, and unveil that a major cause is the inaccurate classification of object proposals.
This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that addresses the fundamental limitations of instance-wise contrastive learning.
Ranked #20 on Semi-Supervised Image Classification on ImageNet - 1% labeled data (Top 5 Accuracy metric)
We address the challenging problem of training object detectors with noisy annotations, where the noise contains a mixture of label noise and bounding box noise.
Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data.
Ranked #5 on Image Classification on Clothing1M (using extra training data)
In this report, we investigate the performance drop phenomenon of state-of-the-art two-stage instance segmentation models when processing extreme long-tail training data based on the LVIS  dataset, and find a major cause is the inaccurate classification of object proposals.
Despite the success of deep neural networks (DNNs) in image classification tasks, the human-level performance relies on massive training data with high-quality manual annotations, which are expensive and time-consuming to collect.
Ranked #15 on Image Classification on Clothing1M (using extra training data)
Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view.
The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes.
However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos.