Video-MME stands for Video Multi-Modal Evaluation. It is the first-ever comprehensive evaluation benchmark specifically designed for Multi-modal Large Language Models (MLLMs) in video analysis¹. This benchmark is significant because it addresses the need for a high-quality assessment of MLLMs' performance in processing sequential visual data, which has been less explored compared to their capabilities in static image understanding.

The Video-MME benchmark is characterized by its: 1. Diversity in video types, covering 6 primary visual domains with 30 subfields for broad scenario generalizability. 2. Duration in the temporal dimension, including short-, medium-, and long-term videos ranging from 11 seconds to 1 hour, to assess robust contextual dynamics. 3. Breadth in data modalities, integrating multi-modal inputs such as video frames, subtitles, and audios. 4. Quality in annotations, with rigorous manual labeling by expert annotators for precise and reliable model assessment¹.

The benchmark includes 900 videos totaling 256 hours, manually selected and annotated, resulting in 2,700 question-answer pairs. It has been used to evaluate various state-of-the-art MLLMs, including the GPT-4 series and Gemini 1.5 Pro, as well as open-source image and video models¹. The findings from Video-MME highlight the need for further improvements in handling longer sequences and multi-modal data, which is crucial for the advancement of MLLMs¹.

(1) [2405.21075] Video-MME: The First-Ever Comprehensive Evaluation .... https://arxiv.org/abs/2405.21075. (2) Video-MME. https://video-mme.github.io/home_page.html. (3) Video-MME: Welcome. https://video-mme.github.io/. (4) undefined. https://doi.org/10.48550/arXiv.2405.21075.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages