LLaVA-OneVision: Easy Visual Task Transfer
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
PDF AbstractCode
Datasets



























Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Visual Question Answering | MM-Vet | LLaVA-OneVision-0.5B | GPT-4 score | 29.1 | # 211 | |
Visual Question Answering | MM-Vet | LLaVA-OneVision-7B | GPT-4 score | 57.5 | # 43 | |
Visual Question Answering | MM-Vet | LLaVA-OneVision-72B | GPT-4 score | 63.7 | # 26 | |
Multiple-choice | Neptune-Full | LLaVA-OneVision (100 frames) | Accuracy (% ) | 66.22 | # 4 | |
Video Question Answering | NExT-QA | LLaVA-OV(7B) | Accuracy | 79.4 | # 15 | |
Video Question Answering | NExT-QA | LLaVA-OV(72B) | Accuracy | 80.2 | # 13 | |
3D Question Answering (3D-QA) | ScanQA Test w/ objects | LLaVA-NeXT-Video | Exact Match | 18.7 | # 16 | |
BLEU-4 | 9.8 | # 13 | ||||
ROUGE | 27.8 | # 16 | ||||
METEOR | 9.1 | # 16 | ||||
CIDEr | 46.2 | # 18 | ||||
3D Question Answering (3D-QA) | SQA3D | LLaVA-NeXT-Video | Exact Match | 34.2 | # 13 | |
Temporal Relation Extraction | Vinoground | LLaVA-OneVision-Qwen2-72B | Text Score | 48.4 | # 4 | |
Video Score | 35.2 | # 3 | ||||
Group Score | 21.8 | # 3 | ||||
Temporal Relation Extraction | Vinoground | LLaVA-OneVision-Qwen2-7B | Text Score | 41.6 | # 5 | |
Video Score | 29.4 | # 6 | ||||
Group Score | 14.6 | # 6 | ||||
Visual Question Answering (VQA) | VLM2-Bench | LLaVA-OneVision-7B | Average Score on VLM2-bench (9 subtasks) | 39.35 | # 7 | |
GC-mat | 16.60 | # 8 | ||||
GC-trk | 13.70 | # 8 | ||||
OC-cpr | 47.22 | # 7 | ||||
OC-cnt | 56.17 | # 4 | ||||
OC-grp | 27.50 | # 8 | ||||
PC-cpr | 62.00 | # 3 | ||||
PC-cnt | 46.67 | # 8 | ||||
PC-grp | 37.00 | # 6 | ||||
PC-VID | 47.25 | # 3 | ||||
Zero-Shot Video Question Answer | VNBench | LLaVA-OneVision-7B | Accuracy | 51.8 | # 4 | |
Zero-Shot Video Question Answer | VNBench | LLaVA-OneVision-72B | Accuracy | 58.7 | # 3 |