Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
PDF AbstractResults from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Chart Question Answering | ChartQA | Qwen-VL-Chat | 1:1 Accuracy | 66.3 | # 17 | ||
Chart Question Answering | ChartQA | Qwen-VL | 1:1 Accuracy | 65.7 | # 19 | ||
Visual Question Answering (VQA) | DocVQA test | Qwen-VL | ANLS | 0.651 | # 31 | ||
Visual Question Answering (VQA) | DocVQA test | Qwen-VL-Plus | ANLS | 0.9024 | # 5 | ||
Visual Question Answering (VQA) | DocVQA test | Qwen-VL-Chat | ANLS | 0.626 | # 33 | ||
Visual Question Answering (VQA) | InfiMM-Eval | Qwen-VL-Chat | Overall score | 37.39 | # 3 | ||
Deductive | 37.55 | # 3 | |||||
Abductive | 44.39 | # 6 | |||||
Analogical | 30.42 | # 2 | |||||
Params | 16B | # 1 | |||||
Visual Question Answering | MM-Vet | Qwen-VL-Plus | GPT-4 score | 61.1±0.2 | # 34 | ||
Visual Question Answering | MM-Vet | Qwen-VL-Max | GPT-4 score | 66.6±0.5 | # 17 | ||
Visual Question Answering | MM-Vet v2 | Qwen-VL-Max | GPT-4 score | 55.8±0.2 | # 12 | ||
MMR total | MRR-Benchmark | Qwen-vl-plus | Total Column Score | 310 | # 9 | ||
MMR total | MRR-Benchmark | Qwen-vl-max | Total Column Score | 366 | # 7 | ||
Natural Language Visual Grounding | ScreenSpot | Qwen-VL | Accuracy (%) | 5.2 | # 17 | ||
FS-MEVQA | SME | Qwen-VL-Max | BLEU-4 | 24.30 | # 4 | ||
METEOR | 23.40 | # 4 | |||||
ROUGE-L | 34.52 | # 4 | |||||
CIDEr | 201.47 | # 4 | |||||
SPICE | 26.13 | # 4 | |||||
Detection | 1.05 | # 4 | |||||
ACC | 40.33 | # 4 | |||||
#Learning Samples (N) | 16 | # 1 | |||||
Visual Question Answering | ViP-Bench | Qwen-VL-Chat (Visual Prompt) | GPT-4 score (bbox) | 39.2 | # 9 | ||
GPT-4 score (human) | 41.7 | # 7 | |||||
Visual Question Answering | ViP-Bench | Qwen-VL-Chat (Coordinates) | GPT-4 score (bbox) | 45.3 | # 6 |