mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.

PDF Abstract CVPR 2024 PDF CVPR 2024 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Question Answering (VQA) InfiMM-Eval mPLUG-Owl2 Overall score 20.05 # 11
Deductive 23.43 # 10
Abductive 20.6 # 11
Analogical 7.64 # 11
Params 7B # 1
Long-Context Understanding MMNeedle mPLUG-Owl-v2 1 Image, 2*2 Stitching, Exact Accuracy 1.9 # 10
1 Image, 4*4 Stitching, Exact Accuracy 0.3 # 10
1 Image, 8*8 Stitching, Exact Accuracy 0.7 # 9
10 Images, 1*1 Stitching, Exact Accuracy 0.4 # 6
10 Images, 2*2 Stitching, Exact Accuracy 0.1 # 6
10 Images, 4*4 Stitching, Exact Accuracy 0 # 6
10 Images, 8*8 Stitching, Exact Accuracy 0 # 3
Visual Question Answering MM-Vet mPLUG-Owl2 GPT-4 score 36.3±0.1 # 145
Params 7B # 1

Methods