A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera.
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Ranked #2 on
Temporal Action Localization
on J-HMDB-21
Learning to represent videos is a very challenging task both algorithmically and computationally.
ACTION CLASSIFICATION ACTION RECOGNITION MULTIMODAL ACTIVITY RECOGNITION OPTICAL FLOW ESTIMATION VIDEO UNDERSTANDING
We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).
Ranked #1 on
Video Classification
on Kinetics
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
Ranked #4 on
Action Recognition
on Something-Something V2
(using extra training data)
ACTION CLASSIFICATION ACTION RECOGNITION VIDEO OBJECT DETECTION VIDEO RECOGNITION VIDEO UNDERSTANDING
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.
Ranked #5 on
Pose Tracking
on PoseTrack2017
(using extra training data)
HUMAN DETECTION MULTI-OBJECT TRACKING POSE ESTIMATION POSE TRACKING VIDEO UNDERSTANDING
In this way, a heavy temporal model is replaced by a simple interlacing operator.
To understand the world, we humans constantly need to relate the present to the past, and put events in context.
Ranked #3 on
Egocentric Activity Recognition
on EPIC-KITCHENS-55
ACTION CLASSIFICATION ACTION RECOGNITION EGOCENTRIC ACTIVITY RECOGNITION VIDEO UNDERSTANDING
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.
Ranked #33 on
Action Recognition
on UCF101
ACTION CLASSIFICATION ACTION RECOGNITION VIDEO UNDERSTANDING
In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.