Visual Commonsense Reasoning
27 papers with code • 7 benchmarks • 7 datasets
Image source: Visual Commonsense Reasoning
These leaderboards are used to track progress in Visual Commonsense Reasoning
Most implemented papers
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
From Recognition to Cognition: Visual Commonsense Reasoning
While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.
Think Visually: Question Answering through Virtual Imagery
In this paper, we study the problem of geometric reasoning in the context of question-answering.
Fusion of Detected Objects in Text for Visual Question Answering
To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language.
Heterogeneous Graph Learning for Visual Commonsense Reasoning
Our HGL consists of a primal vision-to-answer heterogeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph (QAHG) module to interactively refine reasoning paths for semantic agreement.
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines
Despite impressive recent progress that has been reported on tasks that necessitate reasoning, such as visual question answering and visual dialog, models often exploit biases in datasets.