|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
A disentangled representation encodes information about the salient factors of variation in the data independently.
We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
SOTA for Visual Reasoning on NLVR
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A.
SOTA for Visual Question Answering on VQA v2 (Percentage correct metric )
We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.
#2 best model for Visual Reasoning on NLVR
We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation.
Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes.
When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings.
Human activity recognition is typically addressed by detecting key concepts like global and local motion, features related to object classes present in the scene, as well as features related to the global context.