Video Understanding
234 papers with code • 0 benchmarks • 38 datasets
A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
Benchmarks
These leaderboards are used to track progress in Video Understanding
Libraries
Use these libraries to find Video Understanding models and implementationsDatasets
Most implemented papers
Video Swin Transformer
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.
TSM: Temporal Shift Module for Efficient Video Understanding
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
Is Space-Time Attention All You Need for Video Understanding?
We present a convolution-free approach to video classification built exclusively on self-attention over space and time.
SoccerNet 2022 Challenges Results
The SoccerNet 2022 challenges were the second annual video understanding challenges organized by the SoccerNet team.
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Video Instance Segmentation
The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos.
Learnable pooling with Context Gating for video classification
In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.
Representation Flow for Action Recognition
Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.