Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Chart Question Answering ChartQA Qwen-VL-Chat 1:1 Accuracy 66.3 # 17
Chart Question Answering ChartQA Qwen-VL 1:1 Accuracy 65.7 # 19
Visual Question Answering (VQA) DocVQA test Qwen-VL ANLS 0.651 # 31
Visual Question Answering (VQA) DocVQA test Qwen-VL-Plus ANLS 0.9024 # 5
Visual Question Answering (VQA) DocVQA test Qwen-VL-Chat ANLS 0.626 # 33
Visual Question Answering (VQA) InfiMM-Eval Qwen-VL-Chat Overall score 37.39 # 3
Deductive 37.55 # 3
Abductive 44.39 # 6
Analogical 30.42 # 2
Params 16B # 1
Visual Question Answering MM-Vet Qwen-VL-Plus GPT-4 score 61.1±0.2 # 34
Visual Question Answering MM-Vet Qwen-VL-Max GPT-4 score 66.6±0.5 # 17
Visual Question Answering MM-Vet v2 Qwen-VL-Max GPT-4 score 55.8±0.2 # 12
MMR total MRR-Benchmark Qwen-vl-plus Total Column Score 310 # 9
MMR total MRR-Benchmark Qwen-vl-max Total Column Score 366 # 7
Natural Language Visual Grounding ScreenSpot Qwen-VL Accuracy (%) 5.2 # 17
FS-MEVQA SME Qwen-VL-Max BLEU-4 24.30 # 4
METEOR 23.40 # 4
ROUGE-L 34.52 # 4
CIDEr 201.47 # 4
SPICE 26.13 # 4
Detection 1.05 # 4
ACC 40.33 # 4
#Learning Samples (N) 16 # 1
Visual Question Answering ViP-Bench Qwen-VL-Chat (Visual Prompt) GPT-4 score (bbox) 39.2 # 9
GPT-4 score (human) 41.7 # 7
Visual Question Answering ViP-Bench Qwen-VL-Chat (Coordinates) GPT-4 score (bbox) 45.3 # 6

Methods


No methods listed for this paper. Add relevant methods here