Visual Reasoning

215 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Reasoning

Dataset	Best Model	Compare
Winoground	GPT-4V (CoT, pick b/w two options)	See all
NLVR2 Dev	BEiT-3	See all
NLVR2 Test	BEiT-3	See all
WinoGAViL	Humans	See all
Bongard-OpenWorld	Human	See all
VSR	LXMERT	See all
PHYRE-1B-Within	RPIN	See all
PHYRE-1B-Cross	RPIN	See all
VASR	Swin	See all
NLVR	VisualBERT	See all
IRFL: Image Recognition of Figurative Language	Humans	See all
CLEVRER	AI Core	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Reasoning models and implementations

huggingface/transformers

5 papers

125,940

facebookresearch/multimodal

4 papers

1,311

salesforce/lavis

3 papers

8,848

kakao/DAFT

3 papers

See all 7 libraries.

Datasets

Subtasks

Visual Commonsense Reasoning

Most implemented papers

Most implemented Social Latest No code

Learning Transferable Visual Models From Natural Language Supervision

openai/CLIP • • 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

Paper
Code

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis • • 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Paper
Code

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task • • NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Paper
Code

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

fartashf/vsepp • • 18 Jul 2017

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.

Paper
Code

Compositional Attention Networks for Machine Reasoning

stanfordnlp/mac-network • • ICLR 2018

We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.

Paper
Code

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

airsplay/lxmert • • IJCNLP 2019

In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

Paper
Code

Visual Instruction Tuning

haotian-liu/LLaVA • • NeurIPS 2023

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.

Paper
Code

VisualBERT: A Simple and Performant Baseline for Vision and Language

uclanlp/visualbert • • 9 Aug 2019

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.

Paper
Code

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER • • ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Paper
Code

VinVL: Revisiting Visual Representations in Vision-Language Models

pzzhang/VinVL • CVPR 2021

In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.

Paper
Code

Visual Reasoning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result