Visual Reasoning

212 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Reasoning

Dataset	Best Model	Compare
Winoground	GPT-4V (CoT, pick b/w two options)	See all
NLVR2 Dev	BEiT-3	See all
NLVR2 Test	BEiT-3	See all
WinoGAViL	Humans	See all
Bongard-OpenWorld	Human	See all
VSR	LXMERT	See all
PHYRE-1B-Within	RPIN	See all
PHYRE-1B-Cross	RPIN	See all
VASR	Swin	See all
NLVR	VisualBERT	See all
IRFL: Image Recognition of Figurative Language	Humans	See all
CLEVRER	AI Core	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Reasoning models and implementations

huggingface/transformers

5 papers

124,527

facebookresearch/multimodal

4 papers

1,286

salesforce/lavis

3 papers

8,674

kakao/DAFT

3 papers

See all 7 libraries.

Datasets

Subtasks

Visual Commonsense Reasoning

Latest papers

Most implemented Social Latest No code

Neural networks for abstraction and reasoning: Towards broad generalization in machines

mxbi/arckit • 5 Feb 2024

We present the Perceptual Abstraction and Reasoning Language (PeARL) language, which allows DreamCoder to solve ARC tasks, and propose a new recognition model that allows us to significantly improve on the previous best implementation. We also propose a new encoding and augmentation scheme that allows large language models (LLMs) to solve ARC tasks, and find that the largest models can solve some ARC tasks.

05 Feb 2024

Paper
Code

Prompting Large Vision-Language Models for Compositional Reasoning

tossowski/keycomp • • 20 Jan 2024

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space.

20 Jan 2024

Paper
Code

Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually

secureaiautonomylab/conditionalvlm • • 19 Jan 2024

This process involves addressing two key problems: (1) the reason for obfuscating unsafe images demands the platform to provide an accurate rationale that must be grounded in unsafe image-specific attributes, and (2) the unsafe regions in the image must be minimally obfuscated while still depicting the safe regions.

19 Jan 2024

Paper
Code

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

shi-labs/vcoder • • 21 Dec 2023

Secondly, we leverage the images from COCO and outputs from off-the-shelf vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs on the object perception task.

228

21 Dec 2023

Paper
Code

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

bradyfu/awesome-multimodal-large-language-models • 19 Dec 2023

They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks.

8,783

19 Dec 2023

Paper
Code

One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems

mikomel/sal • • 15 Dec 2023

With the aim of developing universal learning systems in the AVR domain, we propose the unified model for solving Single-Choice Abstract visual Reasoning tasks (SCAR), capable of solving various single-choice AVR tasks, without making any a priori assumptions about the task structure, in particular the number and location of panels.

15 Dec 2023

Paper
Code

BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

aifeg/benchlmm • 5 Dec 2023

Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles.

05 Dec 2023

Paper
Code

X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

artemisp/lavis-xinstructblip • • 30 Nov 2023

Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLMs).

30 Nov 2023

Paper
Code

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

01-ai/yi • • 27 Nov 2023

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.

7,093

27 Nov 2023

Paper
Code

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

ucsc-vlaa/vllm-safety-benchmark • • 27 Nov 2023

Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness.

27 Nov 2023

Paper
Code

Visual Reasoning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result