|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels.
Dynamics of human body skeletons convey significant information for human action recognition.
Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.
SOTA for Action Recognition on HMDB51 (using extra training data)
However, for action recognition in videos, the improvement of deep convolutional networks is not so evident.
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.