We propose NorCal, Normalized Calibration for long-tailed object detection and instance segmentation, a simple and straightforward recipe that reweighs the predicted scores of each class by its training sample size.
Many objects do not appear frequently enough in complex scenes (e. g., certain handbags in living rooms) for training an accurate object detector, but are often found frequently by themselves (e. g., in product images).
Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation.
Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions.
Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in Transformers to learn representation from datasets containing images aligned with linguistic expressions that describe the images.
Collectively, the POLL problem setting, the Firehose datasets, and the ConGraD algorithm enable a complete benchmark for reproducible research on web-scale continual learning.
To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially.
To narrate a sequence of images, we use the predicted anchor word embeddings and the image features as the joint input to a seq2seq model.
Model-agnostic meta-learners aim to acquire meta-learned parameters from similar tasks to adapt to novel tasks from the same distribution with few gradient updates.
In this paper, we investigate the problem of generalized few-shot learning (GFSL) -- a model during the deployment is required to learn about tail categories with few shots and simultaneously classify the head classes.
Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision.
One important limitation of such frameworks is that they seek a common initialization shared across the entire task distribution, substantially limiting the diversity of the task distributions that they are able to learn from.
Many few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels.
While such tasks are useful to verify that a machine understands the content of an image, they are not engaging to humans as captions.
Analogous to domain adaptation for visual recognition, this setting is appealing when the target dataset does not have a sufficient amount of labeled data to learn an "in-domain" model.
These properties make the approach particularly appealing for transfer learning for open-ended Visual QA, where the source dataset on which the model is learned has limited overlapping with the target dataset in the space of answers.
In this paper, we exploit this rich structure for performing graph-based inference in label space for a number of tasks: multi-label image and video classification and action detection in untrimmed videos.
), we propose to train a deep network directly on the compressed video.
Ranked #31 on Action Classification on Charades (using extra training data)
We apply the procedures to re-construct decoy answers for two popular Visual QA datasets as well as to create a new Visual QA dataset from the Visual Genome project, resulting in the largest dataset for this task.
We advocate that holistic inference of image concepts provides valuable information for detailed pixel labeling.
We advocate that high-recall holistic inference of image concepts provides valuable information for detailed pixel labeling.
Images of scenes have various objects as well as abundant attributes, and diverse levels of visual categorization are possible.
As a concrete example, group activity recognition involves the interactions and relative spatial relations of a set of people in a scene.
Ranked #3 on Group Activity Recognition on Collective Activity