VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching

Description

VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.

Through extensive evaluation of eight open-source VLMs and GPT-4o using various prompting techniques, we uncover significant challenges in visual cue linking. Even the best-performing model, GPT-4o, falls 34.80% below human-level performance. Our analysis highlights critical areas for improvement: 1. Enhancing core visual understanding with reduced reliance on prior knowledge. 2. Better integration of language reasoning within visual tasks. 3. Developing training approaches that improve independent visual relationship inference.

Dataset Characteristics

  • Size: 3,000+ test cases
  • Modalities: Text, image, video
  • Question Types: True/False, multiple-choice, numerical, open-ended
  • Generation Process: Semi-automated with human verification
  • Structure: Organized into three primary categories:
  • General Cue (GC): Evaluates visual element tracking and matching.
  • Object-centric Cue (OC): Focuses on object comparison, counting, and grouping.
  • Person-centric Cue (PC): Measures the ability to compare, count, group, and describe individuals across frames.

Potential Use Cases

  • Benchmarking vision-language models (VLMs) for real-world multi-modal reasoning.
  • Evaluating visual linking abilities and spatial awareness in large models.
  • Analyzing weaknesses in object permanence and relational inference.
  • Providing insights for improving next-generation vision-language architectures.

Paper & Code

📄 Paper: VLM²-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
📂 Code Repository: GitHub - vlm2-bench/VLM2-Bench

BibTeX Citation

@misc{zhang2025vlm2benchcloserlookvlms,
      title={VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues}, 
      author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung},
      year={2025},
      eprint={2502.12084},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.12084}
}

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


Modalities


Languages