In this work, we deviate from recent, popular task settings and consider the problem under an autonomous vehicle scenario.
Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Ranked #1 on Visual Question Answering on CLEVR-Humans
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Ranked #4 on Phrase Grounding on Flickr30k Entities Test
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships.
Visual dialog entails answering a series of questions grounded in an image, using dialog history as context.
Ranked #1 on Common Sense Reasoning on Visual Dialog v0.9
We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing.
Ranked #1 on Image Captioning on Localized Narratives