593 papers with code • 34 benchmarks • 84 datasets
Human action recognition has become an active research area in recent years, as it plays a significant role in video understanding. In general, human action can be recognized from multiple modalities, such as appearance, depth, optical flows, and body skeletons.
In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.
The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels.
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.
Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems.
The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.