5 papers with code • 1 benchmarks • 1 datasets

Recognizing the limited diversity in existing video conversation benchmarks, we introduce VCGBench-Diverse to comprehensively evaluate the generalization ability of video LMMs. While VCG-Bench provides an extensive evaluation protocol, it is limited to videos from the ActivityNet200 dataset. Our benchmark comprises a total of 877 videos, 18 broad video categories and 4,354 QA pairs, ensuring a robust evaluation framework.

The evaluation is computed over five different aspects:

  1. Correctness of information

  2. Detail orientation

  3. Contextual understanding

  4. Temporal understanding

  5. Consistency.

Additionally, VCGBench-Diverse provides a breakdown of performance across three key aspects:

  1. Dense video captioning, which assesses the ability to generate detailed and accurate descriptions of the video content,

  2. Spatial understanding, which evaluates the capability to understand and describe the spatial relationships and settings within the video

  3. Reasoning, which tests the adeptness in inferring and explaining causal relationships and actions within the video.

Most implemented papers

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

opengvlab/ask-anything CVPR 2024

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

pku-yuangroup/chat-univi CVPR 2024

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

mbzuai-oryx/video-chatgpt 8 Jun 2023

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data.

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm CVPR 2024

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

mbzuai-oryx/videogpt-plus 13 Jun 2024

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding.