Video Understanding
293 papers with code • 0 benchmarks • 42 datasets
A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
Benchmarks
These leaderboards are used to track progress in Video Understanding
Libraries
Use these libraries to find Video Understanding models and implementationsDatasets
Subtasks
Latest papers
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Language Repository for Long Video Understanding
In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i. e., all-textual) representation.
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation
We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding.
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.
Don't Judge by the Look: Towards Motion Coherent Video Representation
Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice.
VideoMamba: State Space Model for Efficient Video Understanding
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones.
World Model on Million-Length Video And Language With Blockwise RingAttention
To address these challenges, we curate a large dataset of diverse videos and books, utilize the Blockwise RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens.
Video Annotator: A framework for efficiently building video classifiers using vision-language models and active learning
High-quality and consistent annotations are fundamental to the successful development of robust machine learning models.
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos.