Video Understanding
300 papers with code • 0 benchmarks • 42 datasets
A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
Benchmarks
These leaderboards are used to track progress in Video Understanding
Libraries
Use these libraries to find Video Understanding models and implementationsDatasets
Subtasks
Most implemented papers
A Multigrid Method for Efficiently Training Video Models
We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).
Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection
In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera.
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization
We propose to explicitly model the Actor-Context-Actor Relation, which is the relation between two actors based on their interactions with the context.
SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos
In this work, we propose SoccerNet-v2, a novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production.
Token Shift Transformer for Video Classification
It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding.
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection
Generic event boundary detection is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries.
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.
Panoptic Video Scene Graph Generation
PVSG relates to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos.
VideoMamba: State Space Model for Efficient Video Understanding
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.
Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge
This paper presents our 7th place solution to the second YouTube-8M video understanding competition which challenges participates to build a constrained-size model to classify millions of YouTube videos into thousands of classes.