Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.
Ranked #1 on Visual Entailment on SNLI-VE val (using extra training data)
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e. g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations.
Ranked #2 on Action Recognition on Diving-48
One key challenge in this task is to ground instructions with the current visual information that the agent perceives.
On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.
During this task, the agent (similar to a PokeMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects part-by-part in an egocentric grid-layout environment.
We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora.
Despite the remarkable successes of Convolutional Neural Networks (CNNs) in computer vision, it is time-consuming and error-prone to manually design a CNN.
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.
For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve.
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
Ranked #1 on Visual Question Answering on VizWiz 2018
To push forward the research in this direction, we first introduce a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.
Our results show the feasibility of a robot learning commonsense knowledge automatically from web-based textual corpora, and the power of learned commonsense reasoning models in enabling a robot to autonomously perform tasks based on incomplete natural language instructions.
Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions.
Ranked #1 on Vision-Language Navigation on Room2Room
Visual reasoning with compositional natural language instructions, e. g., based on the newly-released Cornell Natural Language Visual Reasoning (NLVR) dataset, is a challenging task, where the model needs to have the ability to create an accurate mapping between the diverse phrases and the several objects placed in complex arrangements in the image.
Models that can execute natural language instructions for situated robotic tasks such as assembly and navigation have several useful applications in homes, offices, and remote scenarios.
The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.