VCGBench-Diverse
5 papers with code • 1 benchmarks • 1 datasets
Recognizing the limited diversity in existing video conversation benchmarks, we introduce VCGBench-Diverse to comprehensively evaluate the generalization ability of video LMMs. While VCG-Bench provides an extensive evaluation protocol, it is limited to videos from the ActivityNet200 dataset. Our benchmark comprises a total of 877 videos, 18 broad video categories and 4,354 QA pairs, ensuring a robust evaluation framework.
The evaluation is computed over five different aspects:
-
Correctness of information
-
Detail orientation
-
Contextual understanding
-
Temporal understanding
-
Consistency.
Additionally, VCGBench-Diverse provides a breakdown of performance across three key aspects:
-
Dense video captioning, which assesses the ability to generate detailed and accurate descriptions of the video content,
-
Spatial understanding, which evaluates the capability to understand and describe the spatial relationships and settings within the video
-
Reasoning, which tests the adeptness in inferring and explaining causal relationships and actions within the video.
Most implemented papers
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data.
VTimeLLM: Empower LLM to Grasp Video Moments
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding.