Visual Commonsense Reasoning

29 papers with code • 7 benchmarks • 7 datasets

Image source: Visual Commonsense Reasoning

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Commonsense Reasoning

Dataset	Best Model	Compare
GD-VCR	VisualBERT	See all
VCR (Q-AR) test	PEVL	See all
VCR (QA-R) test	PEVL	See all
VCR (Q-A) test	PEVL	See all
VCR (Q-A) dev	PEVL	See all
VCR (QA-R) dev	PEVL	See all
VCR (Q-AR) dev	PEVL	See all

Datasets

Most implemented papers

Most implemented Social Latest No code

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task • • NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Paper
Code

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER • • ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Paper
Code

From Recognition to Cognition: Visual Commonsense Reasoning

rowanz/r2c • • CVPR 2019

While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world.

Paper
Code

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

jackroos/VL-BERT • • ICLR 2020

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).

Paper
Code

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

zhegan27/VILLA • • NeurIPS 2020

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.

Paper
Code

Unifying Vision-and-Language Tasks via Text Generation

j-min/VL-T5 • • 4 Feb 2021

On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.

Paper
Code

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

yehli/xmodaler • • 18 Aug 2021

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

Paper
Code