TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	InfiMM-Eval	mPLUG-Owl2	Overall score	20.05	# 11
Visual Question Answering (VQA)	InfiMM-Eval	mPLUG-Owl2	Deductive	23.43	# 10
Visual Question Answering (VQA)	InfiMM-Eval	mPLUG-Owl2	Abductive	20.6	# 11
Visual Question Answering (VQA)	InfiMM-Eval	mPLUG-Owl2	Analogical	7.64	# 11
Visual Question Answering (VQA)	InfiMM-Eval	mPLUG-Owl2	Params	7B	# 1
Visual Question Answering	MM-Vet	mPLUG-Owl2	GPT-4 score	36.3±0.1	# 52
Visual Question Answering	MM-Vet	mPLUG-Owl2	Params	7B	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-owl2-revolutionizing-multi-modal-large/visual-question-answering-vqa-on-core-mm)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-core-mm?p=mplug-owl2-revolutionizing-multi-modal-large)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-owl2-revolutionizing-multi-modal-large/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=mplug-owl2-revolutionizing-multi-modal-large)`

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

7 Nov 2023 · Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou ·

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.

PDF Abstract

Code

Add Remove Mark official

x-plug/mplug-owl official

1,941

X-PLUG/mPLUG-Owl official

1,940

Tasks

Add Remove

Language Modelling

Large Language Model

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Question Answering

MMLU

GQA

OK-VQA BBH

MMBench

MM-Vet

TGIF-QA

SEED-Bench

InfiMM-Eval Q-Bench

Results from the Paper

Edit

Ranked #11 on Visual Question Answering (VQA) on InfiMM-Eval

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	InfiMM-Eval	mPLUG-Owl2	Overall score	20.05	# 11	Compare
			Deductive	23.43	# 10	Compare
			Abductive	20.6	# 11	Compare
			Analogical	7.64	# 11	Compare
			Params	7B	# 1	Compare
Visual Question Answering	MM-Vet	mPLUG-Owl2	GPT-4 score	36.3±0.1	# 52	Compare
Visual Question Answering	MM-Vet	mPLUG-Owl2	Params	7B	# 1	Compare

Methods

Add Remove

Focus

Edit Social Preview

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove