VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.
Through extensive evaluation of eight open-source VLMs and GPT-4o using various prompting techniques, we uncover significant challenges in visual cue linking. Even the best-performing model, GPT-4o, falls 34.80% below human-level performance. Our analysis highlights critical areas for improvement: 1. Enhancing core visual understanding with reduced reliance on prior knowledge. 2. Better integration of language reasoning within visual tasks. 3. Developing training approaches that improve independent visual relationship inference.
📄 Paper: VLM²-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
📂 Code Repository: GitHub - vlm2-bench/VLM2-Bench
@misc{zhang2025vlm2benchcloserlookvlms,
title={VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues},
author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung},
year={2025},
eprint={2502.12084},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12084}
}
Paper | Code | Results | Date | Stars |
---|