The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.
#5 best model for Pose Tracking on PoseTrack2017 (using extra training data)
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.
In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
#13 best model for Action Recognition In Videos on Something-Something V1 (using extra training data)
Despite the recent success of end-to-end learned representations, hand-crafted optical flow features are still widely used in video analysis tasks.
To understand the world, we humans constantly need to relate the present to the past, and put events in context.
#3 best model for Action Classification on Charades (using extra training data)
This article describes the final solution of team monkeytyping, who finished in second place in the YouTube-8M video understanding challenge.
Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.