Visual Commonsense Reasoning
20 papers with code • 1 benchmarks • 4 datasets
Image source: Visual Commonsense Reasoning
Most implemented papers
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
From Recognition to Cognition: Visual Commonsense Reasoning
While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
Think Visually: Question Answering through Virtual Imagery
In this paper, we study the problem of geometric reasoning in the context of question-answering.
Fusion of Detected Objects in Text for Visual Question Answering
To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language.
Heterogeneous Graph Learning for Visual Commonsense Reasoning
Our HGL consists of a primal vision-to-answer heterogeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph (QAHG) module to interactively refine reasoning paths for semantic agreement.
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines
Despite impressive recent progress that has been reported on tasks that necessitate reasoning, such as visual question answering and visual dialog, models often exploit biases in datasets.
Connective Cognition Network for Directional Visual Commonsense Reasoning
Inspired by this idea, towards VCR, we propose a connective cognition network (CCN) to dynamically reorganize the visual neuron connectivity that is contextualized by the meaning of questions and answers.