Video Understanding
293 papers with code • 0 benchmarks • 42 datasets
A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
Benchmarks
These leaderboards are used to track progress in Video Understanding
Libraries
Use these libraries to find Video Understanding models and implementationsDatasets
Subtasks
Most implemented papers
TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.
VirtualHome: Simulating Household Activities via Programs
We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to "drive" an artificial agent to execute tasks in a simulated household environment.
Long-Term Feature Banks for Detailed Video Understanding
To understand the world, we humans constantly need to relate the present to the past, and put events in context.
Temporal Interlacing Network
In this way, a heavy temporal model is replaced by a simple interlacing operator.
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device
Secondly, TSM has high efficiency; it achieves a high frame rate of 74fps and 29fps for online video recognition on Jetson Nano and Galaxy Note8.
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.
Flamingo: a Visual Language Model for Few-Shot Learning
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
DeepSportradar-v1: Computer Vision Dataset for Sports Understanding with High Quality Annotations
With the recent development of Deep Learning applied to Computer Vision, sport video understanding has gained a lot of attention, providing much richer information for both sport consumers and leagues.
A Multigrid Method for Efficiently Training Video Models
We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).