Recognizing the limited diversity in existing video conversation benchmarks, we introduce VCGBench-Diverse to comprehensively evaluate the generalization ability of video LMMs. While VCG-Bench provides an extensive evaluation protocol, it is limited to videos from the ActivityNet200 dataset. Our benchmark comprises a total of 877 videos, 18 broad video categories and 4,354 QA pairs, ensuring a robust evaluation framework.

The evaluation is computed over five different aspects:

  1. Correctness of information

  2. Detail orientation

  3. Contextual understanding

  4. Temporal understanding

  5. Consistency.

Additionally, VCGBench-Diverse provides a breakdown of performance across three key aspects:

  1. Dense video captioning, which assesses the ability to generate detailed and accurate descriptions of the video content,

  2. Spatial understanding, which evaluates the capability to understand and describe the spatial relationships and settings within the video

  3. Reasoning, which tests the adeptness in inferring and explaining causal relationships and actions within the video.

