TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Retrieval	DiDeMo	mPLUG-2	text-to-video R@1	45.7	# 6
Zero-Shot Video Retrieval	DiDeMo	mPLUG-2	text-to-video R@5	71.1	# 6
Zero-Shot Video Retrieval	DiDeMo	mPLUG-2	text-to-video R@10	79.2	# 5
Video Retrieval	DiDeMo	mPLUG-2	text-to-video R@1	56.4	# 14
Video Retrieval	DiDeMo	mPLUG-2	text-to-video R@5	79.1	# 17
Video Retrieval	DiDeMo	mPLUG-2	text-to-video R@10	85.2	# 19
Image Classification	ImageNet	mPLUG-2	Top 1 Accuracy	88.5%	# 50
Action Classification	Kinetics-400	mPLUG-2	Acc@1	87.1	# 35
Action Classification	Kinetics-400	mPLUG-2	Acc@5	97.7	# 16
Action Classification	Kinetics-600	mPLUG-2	Top-1 Accuracy	89.8	# 12
Action Classification	Kinetics-600	mPLUG-2	Top-5 Accuracy	98.3	# 7
Action Classification	Kinetics-700	mPLUG-2	Top-1 Accuracy	80.4	# 12
Action Classification	Kinetics-700	mPLUG-2	Top-5 Accuracy	94.9	# 6
Video Retrieval	LSMDC	mPLUG-2	text-to-video R@1	34.4	# 6
Video Retrieval	LSMDC	mPLUG-2	text-to-video R@5	55.2	# 5
Video Retrieval	LSMDC	mPLUG-2	text-to-video R@10	65.1	# 4
Zero-Shot Video Retrieval	LSMDC	mPLUG-2	text-to-video R@1	24.1	# 4
Zero-Shot Video Retrieval	LSMDC	mPLUG-2	text-to-video R@5	43.8	# 3
Zero-Shot Video Retrieval	LSMDC	mPLUG-2	text-to-video R@10	52.0	# 3
Video Captioning	MSR-VTT	mPLUG-2	CIDEr	80.0	# 1
Video Captioning	MSR-VTT	mPLUG-2	METEOR	34.9	# 2
Video Captioning	MSR-VTT	mPLUG-2	ROUGE-L	70.1	# 1
Video Captioning	MSR-VTT	mPLUG-2	BLEU-4	57.8	# 1
Zero-Shot Video Retrieval	MSR-VTT	mPLUG-2	text-to-video R@1	47.1	# 4
Zero-Shot Video Retrieval	MSR-VTT	mPLUG-2	text-to-video R@5	69.7	# 4
Zero-Shot Video Retrieval	MSR-VTT	mPLUG-2	text-to-video R@10	79.0	# 3
Video Retrieval	MSR-VTT-1kA	mPLUG-2	text-to-video R@1	53.1	# 11
Video Retrieval	MSR-VTT-1kA	mPLUG-2	text-to-video R@5	77.6	# 11
Video Retrieval	MSR-VTT-1kA	mPLUG-2	text-to-video R@10	84.7	# 14
Visual Question Answering (VQA)	MSRVTT-QA	mPLUG-2	Accuracy	0.480	# 3
Video Question Answering	MSRVTT-QA	mPLUG-2	Accuracy	48.0	# 6
Video Captioning	MSVD	mPLUG-2	CIDEr	165.8	# 5
Video Captioning	MSVD	mPLUG-2	BLEU-4	70.5	# 5
Video Captioning	MSVD	mPLUG-2	METEOR	48.4	# 3
Video Captioning	MSVD	mPLUG-2	ROUGE-L	85.3	# 3
Visual Question Answering (VQA)	MSVD-QA	mPLUG-2	Accuracy	0.581	# 7
Visual Grounding	RefCOCO+ testA	mPLUG-2	Accuracy (%)	92.8	# 1
Visual Grounding	RefCOCO+ test B	mPLUG-2	Accuracy (%)	86.05	# 1
Visual Grounding	RefCOCO+ val	mPLUG-2	Accuracy (%)	90.33	# 1
TGIF-Frame	TGIF-QA	mPLUG-2	Accuracy	75.4	# 6
Visual Question Answering (VQA)	VQA v2 test-dev	mPLUG-2	Accuracy	81.11	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/video-captioning-on-msr-vtt-1)](https://paperswithcode.com/sota/video-captioning-on-msr-vtt-1?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/visual-grounding-on-refcoco-testa)](https://paperswithcode.com/sota/visual-grounding-on-refcoco-testa?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/visual-grounding-on-refcoco-test-b)](https://paperswithcode.com/sota/visual-grounding-on-refcoco-test-b?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/visual-grounding-on-refcoco-val)](https://paperswithcode.com/sota/visual-grounding-on-refcoco-val?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/video-captioning-on-msvd-1)](https://paperswithcode.com/sota/video-captioning-on-msvd-1?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/video-question-answering-on-msrvtt-qa)](https://paperswithcode.com/sota/video-question-answering-on-msrvtt-qa?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/tgif-frame-on-tgif-qa)](https://paperswithcode.com/sota/tgif-frame-on-tgif-qa?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=mplug-2-a-modularized-multi-modal-foundation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-2-a-modularized-multi-modal-foundation/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=mplug-2-a-modularized-multi-modal-foundation)`

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

1 Feb 2023 · Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou ·

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

PDF Abstract

Code

Add Remove Mark official

alibaba/AliceMind official

1,933

modelscope/modelscope

6,039

x-plug/mplug-owl

1,926

X-PLUG/mPLUG-2

208

Tasks

Add Remove

Action Classification

Image Classification

TGIF-Frame

Video Captioning

Video Question Answering

Video Retrieval

Visual Grounding

Visual Question Answering (VQA)

Zero-Shot Video Retrieval

Datasets

ImageNet

MS COCO

GLUE

Visual Question Answering

Kinetics

QNLI

Kinetics 400

MSR-VTT

Visual Question Answering v2.0

RefCOCO

MSVD

DiDeMo

Kinetics-600

LSMDC

Kinetics-700

TGIF-QA MSRVTT-QA MSVD-QA

Results from the Paper

Edit

Ranked #1 on Video Captioning on MSR-VTT

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Retrieval	DiDeMo	mPLUG-2	text-to-video R@1	45.7	# 6	Compare
			text-to-video R@5	71.1	# 6	Compare
			text-to-video R@10	79.2	# 5	Compare
Video Retrieval	DiDeMo	mPLUG-2	text-to-video R@1	56.4	# 14	Compare
			text-to-video R@5	79.1	# 17	Compare
			text-to-video R@10	85.2	# 19	Compare
Image Classification	ImageNet	mPLUG-2	Top 1 Accuracy	88.5%	# 50	Compare
Action Classification	Kinetics-400	mPLUG-2	Acc@1	87.1	# 35	Compare
Action Classification	Kinetics-400	mPLUG-2	Acc@5	97.7	# 16	Compare
Action Classification	Kinetics-600	mPLUG-2	Top-1 Accuracy	89.8	# 12	Compare
Action Classification	Kinetics-600	mPLUG-2	Top-5 Accuracy	98.3	# 7	Compare
Action Classification	Kinetics-700	mPLUG-2	Top-1 Accuracy	80.4	# 12	Compare
Action Classification	Kinetics-700	mPLUG-2	Top-5 Accuracy	94.9	# 6	Compare
Video Retrieval	LSMDC	mPLUG-2	text-to-video R@1	34.4	# 6	Compare
			text-to-video R@5	55.2	# 5	Compare
			text-to-video R@10	65.1	# 4	Compare
Zero-Shot Video Retrieval	LSMDC	mPLUG-2	text-to-video R@1	24.1	# 4	Compare
			text-to-video R@5	43.8	# 3	Compare
			text-to-video R@10	52.0	# 3	Compare
Video Captioning	MSR-VTT	mPLUG-2	CIDEr	80.0	# 1	Compare
			METEOR	34.9	# 2	Compare
			ROUGE-L	70.1	# 1	Compare
			BLEU-4	57.8	# 1	Compare
Zero-Shot Video Retrieval	MSR-VTT	mPLUG-2	text-to-video R@1	47.1	# 4	Compare
			text-to-video R@5	69.7	# 4	Compare
			text-to-video R@10	79.0	# 3	Compare
Video Retrieval	MSR-VTT-1kA	mPLUG-2	text-to-video R@1	53.1	# 11	Compare
			text-to-video R@5	77.6	# 11	Compare
			text-to-video R@10	84.7	# 14	Compare
Visual Question Answering (VQA)	MSRVTT-QA	mPLUG-2	Accuracy	0.480	# 3	Compare
Video Question Answering	MSRVTT-QA	mPLUG-2	Accuracy	48.0	# 6	Compare
Video Captioning	MSVD	mPLUG-2	CIDEr	165.8	# 5	Compare
			BLEU-4	70.5	# 5	Compare
			METEOR	48.4	# 3	Compare
			ROUGE-L	85.3	# 3	Compare
Visual Question Answering (VQA)	MSVD-QA	mPLUG-2	Accuracy	0.581	# 7	Compare
Visual Grounding	RefCOCO+ testA	mPLUG-2	Accuracy (%)	92.8	# 1	Compare
Visual Grounding	RefCOCO+ test B	mPLUG-2	Accuracy (%)	86.05	# 1	Compare
Visual Grounding	RefCOCO+ val	mPLUG-2	Accuracy (%)	90.33	# 1	Compare
TGIF-Frame	TGIF-QA	mPLUG-2	Accuracy	75.4	# 6	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	mPLUG-2	Accuracy	81.11	# 9	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove