Video Understanding
294 papers with code • 0 benchmarks • 42 datasets
A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
Benchmarks
These leaderboards are used to track progress in Video Understanding
Libraries
Use these libraries to find Video Understanding models and implementationsDatasets
Subtasks
Latest papers
Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection
Video moment retrieval and highlight detection are two highly valuable tasks in video understanding, but until recently they have been jointly studied.
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.
LongVLM: Efficient Long Video Understanding via Large Language Models
In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos.
SnAG: Scalable and Accurate Video Grounding
In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models.
R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries.
$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries.
ST-LLM: Large Language Models Are Effective Temporal Learners
In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?
Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
Video paragraph captioning (VPC) involves generating detailed narratives for long videos, utilizing supportive modalities such as speech and event boundaries.
OmniVid: A Generative Framework for Universal Video Understanding
The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.
Understanding Long Videos in One Multimodal Language Model Pass
In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information.