A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition.
Ranked #13 on Action Classification on Kinetics-600
We further take the server-client and inter-client domain shifts into account and pose a domain adaptation problem with one source (centralized server data) and multiple targets (distributed client data).
In this paper, we explore the use of Single Image Texture Translation (SITT) for data augmentation.
This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label.
On COCO, ViLD outperforms previous SOTA by 4. 8 on novel AP and 11. 4 on overall AP.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
Ranked #1 on Action Classification on Moments in Time (using extra training data)
Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3. 6 mask AP on rare categories.
Ranked #1 on Object Detection on LVIS v1.0
Furthermore, SpineNet is built with a uniform resource distribution over operations.
Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away.
Ranked #1 on Self-Supervised Action Recognition on Kinetics-600
For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data.
Ranked #1 on Semantic Segmentation on PASCAL VOC 2012 test (using extra training data)
In this work we explore the task of instance segmentation with attribute localization, which unifies instance segmentation (detect and segment each object instance) and fine-grained visual attribute categorization (recognize one or multiple attributes).
We also investigate the interplay between dataset granularity with a variety of factors and find that fine-grained datasets are more difficult to learn from, more difficult to transfer to, more difficult to perform few-shot learning with, and more vulnerable to adversarial attacks.
We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.
Ranked #4 on Image Classification on iNaturalist
The dataset is constructed from over one million fashion images with a label space that includes 8 groups of 228 fine-grained attributes in total.
We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss.
Ranked #2 on Long-tail Learning on EGTEA
To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions.
We propose a measure to estimate domain similarity via Earth Mover's Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure.
Ranked #13 on Fine-Grained Image Classification on CUB-200-2011
Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories.
Ranked #3 on Image Classification on iNaturalist
We demonstrate how to approximate kernels such as Gaussian RBF up to a given order using compact explicit feature maps in a parameter-free manner.
Metric learning algorithms produce distance metrics that capture the important relationships among data.
Ranked #1 on Recommendation Systems on Million Song Dataset (Recall@100 metric)
To demonstrate the effectiveness of the proposed framework, we bootstrap a fine-grained flower dataset with 620 categories from Instagram images.
Most approaches predict the location of a query image by matching to ground-level images with known locations (e. g., street-view data).
In this paper, we propose to build Concept Bank, the largest concept library consisting of 4, 876 concepts specifically designed to cover 631 real-world events.