Visual Commonsense Reasoning

27 papers with code • 7 benchmarks • 7 datasets

Most implemented papers

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

From Recognition to Cognition: Visual Commonsense Reasoning

rowanz/r2c CVPR 2019

While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world.

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

jackroos/VL-BERT ICLR 2020

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

zhegan27/VILLA NeurIPS 2020

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

yehli/xmodaler 18 Aug 2021

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

Think Visually: Question Answering through Virtual Imagery

umich-vl/think_visually ACL 2018

In this paper, we study the problem of geometric reasoning in the context of question-answering.

Fusion of Detected Objects in Text for Visual Question Answering

google-research/language IJCNLP 2019

To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language.

Heterogeneous Graph Learning for Visual Commonsense Reasoning

yuweijiang/HGL-pytorch NeurIPS 2019

Our HGL consists of a primal vision-to-answer heterogeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph (QAHG) module to interactively refine reasoning paths for semantic agreement.

TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines

Deanplayerljx/tab-vcr NeurIPS 2019

Despite impressive recent progress that has been reported on tasks that necessitate reasoning, such as visual question answering and visual dialog, models often exploit biases in datasets.