TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Multimodal Reasoning	REBUS	GPT-4V	Accuracy	24.0	# 1
Multimodal Reasoning	REBUS	Gemini Pro	Accuracy	13.2	# 2
Multimodal Reasoning	REBUS	LLaVa-1.5-13B	Accuracy	1.8	# 3
Multimodal Reasoning	REBUS	LLaVa-1.5-7B	Accuracy	1.5	# 4
Multimodal Reasoning	REBUS	BLIP2-FLAN-T5-XXL	Accuracy	0.9	# 5
Multimodal Reasoning	REBUS	CogVLM	Accuracy	0.9	# 5
Multimodal Reasoning	REBUS	QWEN	Accuracy	0.9	# 5
Multimodal Reasoning	REBUS	InstructBLIP	Accuracy	0.6	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rebus-a-robust-evaluation-benchmark-of-1/multimodal-reasoning-on-rebus)](https://paperswithcode.com/sota/multimodal-reasoning-on-rebus?p=rebus-a-robust-evaluation-benchmark-of-1)`

REBUS: A Robust Evaluation Benchmark of Understanding Symbols

11 Jan 2024 · Andrew Gritsevskiy, Arjun Panickssery, Aaron Kirtland, Derik Kauffman, Hans Gundlach, Irina Gritsevskaya, Joe Cavanagh, Jonathan Chiang, Lydia La Roux, Michelle Hung ·

We propose a new benchmark evaluating the performance of multimodal large language models on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with hypothesis testing, multi-step reasoning, and an understanding of human cognition, making for a complex, multimodal evaluation of capabilities. We find that proprietary models such as GPT-4V and Gemini Pro significantly outperform all other tested models. However, even the best model has a final accuracy of just 24%, highlighting the need for substantial improvements in reasoning. Further, models rarely understand all parts of a puzzle, and are almost always incapable of retroactively explaining the correct answer. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.

PDF Abstract

Code

Add Remove Mark official

cvndsh/rebus official

Tasks

Add Remove

Datasets

REBUS

Results from the Paper

Add Remove

Ranked #1 on Multimodal Reasoning on REBUS

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Multimodal Reasoning	REBUS	GPT-4V	Accuracy	24.0	# 1	Compare
Multimodal Reasoning	REBUS	Gemini Pro	Accuracy	13.2	# 2	Compare
Multimodal Reasoning	REBUS	LLaVa-1.5-13B	Accuracy	1.8	# 3	Compare
Multimodal Reasoning	REBUS	LLaVa-1.5-7B	Accuracy	1.5	# 4	Compare
Multimodal Reasoning	REBUS	BLIP2-FLAN-T5-XXL	Accuracy	0.9	# 5	Compare
Multimodal Reasoning	REBUS	CogVLM	Accuracy	0.9	# 5	Compare
Multimodal Reasoning	REBUS	QWEN	Accuracy	0.9	# 5	Compare
Multimodal Reasoning	REBUS	InstructBLIP	Accuracy	0.6	# 8	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

REBUS: A Robust Evaluation Benchmark of Understanding Symbols

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove