LLaVA-OneVision: Easy Visual Task Transfer

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Visual Question Answering MM-Vet LLaVA-OneVision-0.5B GPT-4 score 29.1 # 211
Visual Question Answering MM-Vet LLaVA-OneVision-7B GPT-4 score 57.5 # 43
Visual Question Answering MM-Vet LLaVA-OneVision-72B GPT-4 score 63.7 # 26
Multiple-choice Neptune-Full LLaVA-OneVision (100 frames) Accuracy (% ) 66.22 # 4
Video Question Answering NExT-QA LLaVA-OV(7B) Accuracy 79.4 # 15
Video Question Answering NExT-QA LLaVA-OV(72B) Accuracy 80.2 # 13
3D Question Answering (3D-QA) ScanQA Test w/ objects LLaVA-NeXT-Video Exact Match 18.7 # 16
BLEU-4 9.8 # 13
ROUGE 27.8 # 16
METEOR 9.1 # 16
CIDEr 46.2 # 18
3D Question Answering (3D-QA) SQA3D LLaVA-NeXT-Video Exact Match 34.2 # 13
Temporal Relation Extraction Vinoground LLaVA-OneVision-Qwen2-72B Text Score 48.4 # 4
Video Score 35.2 # 3
Group Score 21.8 # 3
Temporal Relation Extraction Vinoground LLaVA-OneVision-Qwen2-7B Text Score 41.6 # 5
Video Score 29.4 # 6
Group Score 14.6 # 6
Visual Question Answering (VQA) VLM2-Bench LLaVA-OneVision-7B Average Score on VLM2-bench (9 subtasks) 39.35 # 7
GC-mat 16.60 # 8
GC-trk 13.70 # 8
OC-cpr 47.22 # 7
OC-cnt 56.17 # 4
OC-grp 27.50 # 8
PC-cpr 62.00 # 3
PC-cnt 46.67 # 8
PC-grp 37.00 # 6
PC-VID 47.25 # 3
Zero-Shot Video Question Answer VNBench LLaVA-OneVision-7B Accuracy 51.8 # 4
Zero-Shot Video Question Answer VNBench LLaVA-OneVision-72B Accuracy 58.7 # 3

Methods


No methods listed for this paper. Add relevant methods here