In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
Ranked #1 on
Visual Question Answering
on VizWiz 2018
LANGUAGE MODELLING QUESTION ANSWERING VISUAL QUESTION ANSWERING VISUAL REASONING
Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes.
Ranked #4 on
Visual Question Answering
on CLEVR-Humans
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A.
Ranked #4 on
Scene Graph Generation
on Visual Genome
GRAPH GENERATION SCENE GRAPH GENERATION VISUAL QUESTION ANSWERING VISUAL REASONING
We introduce the Neural State Machine, seeking to bridge the gap between the neural and symbolic views of AI and integrate their complementary strengths for the task of visual reasoning.
Ranked #1 on
Visual Question Answering
on GQA test-dev
We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets.
Ranked #4 on
Visual Question Answering
on GQA test-std
QUESTION ANSWERING VISUAL QUESTION ANSWERING VISUAL REASONING
We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.
Ranked #1 on
Visual Question Answering
on CLEVR-Humans
The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.
Ranked #3 on
Visual Reasoning
on PHYRE-1B-Within
Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks.
Ranked #3 on
Visual Question Answering
on CLEVR
QUESTION ANSWERING VISUAL QUESTION ANSWERING VISUAL REASONING
We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.
Ranked #1 on
Phrase Grounding
on Flickr30k Entities Dev
LANGUAGE MODELLING VISUAL QUESTION ANSWERING VISUAL REASONING
We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language.