Two-Stream Convolutional Networks for Action Recognition in Videos

NeurIPS 2014  ·  Karen Simonyan, Andrew Zisserman ·

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

PDF Abstract NeurIPS 2014 PDF NeurIPS 2014 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Classification Charades 2-Strm MAP 18.6 # 48
Action Recognition HMDB-51 Two-Stream (ImageNet pretrained) Average accuracy of 3 splits 59.4 # 66
Action Recognition UCF101 Two-Stream (ImageNet pretrained) 3-fold Accuracy 88.0 # 72
Hand Gesture Recognition VIVA Hand Gestures Dataset Two Stream CNNs Accuracy 68 # 3


No methods listed for this paper. Add relevant methods here