MVBench

9 papers with code • 0 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

opengvlab/ask-anything CVPR 2024

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

CogVLM2: Visual Language Models for Image and Video Understanding

thudm/glm-4 29 Aug 2024

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications.

ST-LLM: Large Language Models Are Effective Temporal Learners

TencentARC/ST-LLM 30 Mar 2024

In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

magic-research/PLLaVA arXiv 2024

PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

mbzuai-oryx/videogpt-plus 13 Jun 2024

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding.

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

qq-mm/video-ccam 26 Aug 2024

To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM).

Enhancing Temporal Modeling of Video LLMs via Time Gating

lavi-lab/tg-vid 8 Oct 2024

However, most existing Video LLMs neglect temporal information in video data, leading to struggles with temporal-aware video understanding.

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

tingyu215/ts-llava 17 Nov 2024

For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

dvlab-research/Lyra 12 Dec 2024

As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI.