Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Referring Expression Comprehension ColonINST-v1 (Seen) MGM-2B (w/o LoRA, w/ extra data) Intersection over Union 57.25 # 4
Referring Expression Comprehension ColonINST-v1 (Seen) MGM-2B (w/o LoRA, w/o extra data) Intersection over Union 39.78 # 11
Referring expression generation ColonINST-v1 (Seen) MGM-2B (w/o LoRA, w/ extra data) Accuray 98.75 # 4
Referring expression generation ColonINST-v1 (Seen) MGM-2B (w/o LoRA, w/o extra data) Accuray 98.17 # 6
Image Classification ColonINST-v1 (Seen) MGM-2B (w/o LoRA, w/o extra data) Accuray 92.97 # 9
Image Classification ColonINST-v1 (Seen) MGM-2B (w/o LoRA, w/ extra data) Accuray 93.24 # 7
Image Classification ColonINST-v1 (Unseen) MGM-2B (w/o LoRA, w/ extra data) Accuray 78.69 # 9
Referring expression generation ColonINST-v1 (Unseen) MGM-2B (w/o LoRA, w/ extra data) Accuray 74.30 # 6
Referring expression generation ColonINST-v1 (Unseen) MGM-2B (w/o LoRA, w/o extra data) Accuray 69.81 # 14
Image Classification ColonINST-v1 (Unseen) MGM-2B (w/o LoRA, w/o extra data) Accuray 78.99 # 7
Referring Expression Comprehension ColonINST-v1 (Unseen) MGM-2B (w/o LoRA, w/ extra data) Intersection over Union 25.23 # 10
Referring Expression Comprehension ColonINST-v1 (Unseen) MGM-2B (w/o LoRA, w/o extra data) Intersection over Union 16.00 # 13
Visual Question Answering MM-Vet Mini-Gemini-HD GPT-4 score 59.3 # 40
Visual Question Answering MM-Vet Mini-Gemini GPT-4 score 53.0 # 50
Visual Question Answering MM-Vet Mini-Gemini-HD-BS GPT-4 score 60.8 # 35

Methods