Visual Reasoning

215 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Reasoning

Dataset	Best Model	Compare
Winoground	GPT-4V (CoT, pick b/w two options)	See all
NLVR2 Dev	BEiT-3	See all
NLVR2 Test	BEiT-3	See all
WinoGAViL	Humans	See all
Bongard-OpenWorld	Human	See all
VSR	LXMERT	See all
PHYRE-1B-Within	RPIN	See all
PHYRE-1B-Cross	RPIN	See all
VASR	Swin	See all
NLVR	VisualBERT	See all
IRFL: Image Recognition of Figurative Language	Humans	See all
CLEVRER	AI Core	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Reasoning models and implementations

huggingface/transformers

5 papers

125,725

facebookresearch/multimodal

4 papers

1,307

salesforce/lavis

3 papers

8,830

kakao/DAFT

3 papers

See all 7 libraries.

Datasets

Subtasks

Visual Commonsense Reasoning

Latest papers with no code

Most implemented Social Latest No code

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

no code yet • 30 Apr 2024

We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting.

Paper
Add Code

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

no code yet • 26 Apr 2024

Specifically, we design a vision-based edit generator and state evaluator to work together to find the correct sequence of actions to achieve the goal.

Paper
Add Code

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

no code yet • 24 Apr 2024

This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability.

Paper
Add Code

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

no code yet • 23 Apr 2024

The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules.

Paper
Add Code

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

no code yet • 16 Apr 2024

Large Vision-Language Models (LVLMs), due to the remarkable visual reasoning ability to understand images and videos, have received widespread attention in the autonomous driving domain, which significantly advances the development of interpretable end-to-end autonomous driving.

Paper
Add Code

Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry

no code yet • 9 Apr 2024

In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu's method is surprisingly strong.

Paper
Add Code

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

no code yet • 28 Mar 2024

The surge of Multimodal Large Language Models (MLLMs), given their prominent emergent capabilities in instruction following and reasoning, has greatly advanced the field of visual reasoning.

Paper
Add Code

PropTest: Automatic Property Testing for Improved Visual Programming

no code yet • 25 Mar 2024

Visual Programming has emerged as an alternative to end-to-end black-box visual reasoning models.

Paper
Add Code

VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

no code yet • 21 Mar 2024

In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs.

Paper
Add Code

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

no code yet • 19 Mar 2024

Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities.

Paper
Add Code

Visual Reasoning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result