mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Video Retrieval DiDeMo mPLUG-2 text-to-video R@1 45.7 # 6
text-to-video R@5 71.1 # 6
text-to-video R@10 79.2 # 5
Video Retrieval DiDeMo mPLUG-2 text-to-video R@1 56.4 # 14
text-to-video R@5 79.1 # 17
text-to-video R@10 85.2 # 19
Image Classification ImageNet mPLUG-2 Top 1 Accuracy 88.5% # 50
Action Classification Kinetics-400 mPLUG-2 Acc@1 87.1 # 35
Acc@5 97.7 # 16
Action Classification Kinetics-600 mPLUG-2 Top-1 Accuracy 89.8 # 12
Top-5 Accuracy 98.3 # 7
Action Classification Kinetics-700 mPLUG-2 Top-1 Accuracy 80.4 # 12
Top-5 Accuracy 94.9 # 6
Video Retrieval LSMDC mPLUG-2 text-to-video R@1 34.4 # 6
text-to-video R@5 55.2 # 5
text-to-video R@10 65.1 # 4
Zero-Shot Video Retrieval LSMDC mPLUG-2 text-to-video R@1 24.1 # 4
text-to-video R@5 43.8 # 3
text-to-video R@10 52.0 # 3
Video Captioning MSR-VTT mPLUG-2 CIDEr 80.0 # 1
METEOR 34.9 # 2
ROUGE-L 70.1 # 1
BLEU-4 57.8 # 1
Zero-Shot Video Retrieval MSR-VTT mPLUG-2 text-to-video R@1 47.1 # 4
text-to-video R@5 69.7 # 4
text-to-video R@10 79.0 # 3
Video Retrieval MSR-VTT-1kA mPLUG-2 text-to-video R@1 53.1 # 11
text-to-video R@5 77.6 # 11
text-to-video R@10 84.7 # 14
Visual Question Answering (VQA) MSRVTT-QA mPLUG-2 Accuracy 0.480 # 3
Video Question Answering MSRVTT-QA mPLUG-2 Accuracy 48.0 # 6
Video Captioning MSVD mPLUG-2 CIDEr 165.8 # 5
BLEU-4 70.5 # 5
METEOR 48.4 # 3
ROUGE-L 85.3 # 3
Visual Question Answering (VQA) MSVD-QA mPLUG-2 Accuracy 0.581 # 7
Visual Grounding RefCOCO+ testA mPLUG-2 Accuracy (%) 92.8 # 1
Visual Grounding RefCOCO+ test B mPLUG-2 Accuracy (%) 86.05 # 1
Visual Grounding RefCOCO+ val mPLUG-2 Accuracy (%) 90.33 # 1
TGIF-Frame TGIF-QA mPLUG-2 Accuracy 75.4 # 6
Visual Question Answering (VQA) VQA v2 test-dev mPLUG-2 Accuracy 81.11 # 9

Methods


No methods listed for this paper. Add relevant methods here