This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation.
Ranked #45 on Semantic Segmentation on ADE20K
Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships.
Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains.
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.
Ranked #4 on Image Retrieval on MS COCO
We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning.
We present Worldsheet, a method for novel view synthesis using just a single RGB image as input.
Image descriptions can help visually impaired people to quickly understand the image content.
Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question.
The actual grounding can connect language to the environment through multiple modalities, e. g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route.
E. g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction.
Ranked #3 on Referring Expression Comprehension on CLEVR-Ref+
Our model improves the textual explanation quality of fine-grained classification decisions on the CUB dataset by mentioning phrases that are grounded in the image.
In complex inferential tasks like question answering, machine learning models must confront two challenges: the need to implement a compositional reasoning process, and, in many applications, the need for this reasoning process to be interpretable to assist users in both development and prediction.
Ranked #14 on Referring Expression Comprehension on Talk2Car
We call such textual explanations counterfactual explanations, and propose an intuitive method to generate counterfactual explanations by inspecting which evidence in an input is missing, but might contribute to a different classification decision if present in the image.
We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction.
Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image.
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
Ranked #42 on Visual Question Answering (VQA) on VQA v2 test-dev
In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene.
Ranked #1 on Visual Question Answering (VQA) on Visual7W
Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression.
To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.
Ranked #16 on Referring Expression Segmentation on J-HMDB
In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.
Ranked #12 on Referring Expression Comprehension on Talk2Car
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Ranked #12 on Phrase Grounding on Flickr30k Entities Test
Our approach proves to be especially useful in large scale settings with thousands of classes, where spatial and semantic interactions are very frequent and only weakly supervised detectors can be built due to a lack of bounding box annotations.
A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories.