95 papers with code • 0 benchmarks • 25 datasets
A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
With rapidly evolving internet technologies and emerging tools, sports related videos generated online are increasing at an unprecedentedly fast pace.
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.
Ranked #1 on Action Classification on Kinetics-400 (using extra training data)
Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e. g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations.
Ranked #1 on Action Recognition on Diving-48
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions.
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Detecting generic, taxonomy-free event boundaries invideos represents a major stride forward towards holisticvideo understanding.
However, due to the limited amount of labeled data that is commonly available for training automatic sign (language) recognition, the VTN cannot reach its full potential in this domain.
Ranked #3 on Sign Language Recognition on AUTSL