Visual Reasoning

211 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Reasoning

Dataset	Best Model	Compare
Winoground	GPT-4V (CoT, pick b/w two options)	See all
NLVR2 Dev	BEiT-3	See all
NLVR2 Test	BEiT-3	See all
WinoGAViL	Humans	See all
Bongard-OpenWorld	Human	See all
VSR	LXMERT	See all
PHYRE-1B-Within	RPIN	See all
PHYRE-1B-Cross	RPIN	See all
VASR	Swin	See all
NLVR	VisualBERT	See all
IRFL: Image Recognition of Figurative Language	Humans	See all
CLEVRER	AI Core	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Reasoning models and implementations

huggingface/transformers

5 papers

124,353

facebookresearch/multimodal

4 papers

1,286

salesforce/lavis

3 papers

8,658

kakao/DAFT

3 papers

See all 7 libraries.

Datasets

Subtasks

Visual Commonsense Reasoning

Most implemented papers

Most implemented Social Latest No code

FiLM: Visual Reasoning with a General Conditioning Layer

ethanjperez/film • • 22 Sep 2017

We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation.

Paper
Code

Learning to Compose Dynamic Tree Structures for Visual Contexts

KaihuaTang/Scene-Graph-Benchmark.pytorch • • CVPR 2019

We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A.

Paper
Code

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

salesforce/lavis • • 28 Jan 2022

Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.

Paper
Code

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

ethanjperez/film • • CVPR 2017

When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings.

Paper
Code

Inferring and Executing Programs for Visual Reasoning

facebookresearch/clevr-iep • • ICCV 2017

Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes.

Paper
Code

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

stanfordnlp/mac-network • • CVPR 2019

We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets.

Paper
Code

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

dandelin/vilt • • 5 Feb 2021

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.

Paper
Code

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis • • NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Paper
Code

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip • • 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

Paper
Code

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

vision-cair/minigpt-4 • • 20 Apr 2023

Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts.

Paper
Code

Visual Reasoning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result