MVBench
9 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in MVBench
Most implemented papers
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.
CogVLM2: Visual Language Models for Image and Video Understanding
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications.
ST-LLM: Large Language Models Are Effective Temporal Learners
In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding.
Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos
To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM).
Enhancing Temporal Modeling of Video LLMs via Time Gating
However, most existing Video LLMs neglect temporal information in video data, leading to struggles with temporal-aware video understanding.
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI.