TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Visual Question Answering	MM-Vet	MM-ReAct-GPT-4	GPT-4 score	44.6±0.2	# 22
Visual Question Answering	MM-Vet	MM-ReAct-GPT-3.5	GPT-4 score	27.9±0.1	# 78

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mm-react-prompting-chatgpt-for-multimodal/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=mm-react-prompting-chatgpt-for-multimodal)`

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

20 Mar 2023 · Zhengyuan Yang, Linjie Li, JianFeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang ·

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

PDF Abstract

Code

Add Remove Mark official

microsoft/MM-REACT official

↳ Quickstart in

Spaces

906

Tasks

Add Remove

Multimodal Reasoning

Visual Question Answering

Datasets

MM-Vet

Results from the Paper

Add Remove

Ranked #22 on Visual Question Answering on MM-Vet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering	MM-Vet	MM-ReAct-GPT-4	GPT-4 score	44.6±0.2	# 22	Compare
Visual Question Answering	MM-Vet	MM-ReAct-GPT-3.5	GPT-4 score	27.9±0.1	# 78	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove