A Closer Look at Spatiotemporal Convolutions for Action Recognition

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.

PDF Abstract CVPR 2018 PDF CVPR 2018 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Action Recognition HMDB-51 R[2+1]D-RGB (Sports1M pretrained) Average accuracy of 3 splits 66.6 # 58
Action Recognition HMDB-51 R[2+1]D-TwoStream (Kinetics pretrained) Average accuracy of 3 splits 78.7 # 26
Action Recognition HMDB-51 R[2+1]D-Flow (Kinetics pretrained) Average accuracy of 3 splits 76.4 # 35
Action Recognition HMDB-51 R[2+1D]D-TwoStream (Sports1M pretrained) Average accuracy of 3 splits 72.7 # 45
Action Recognition HMDB-51 R[2+1]D-Flow (Sports1M pretrained) Average accuracy of 3 splits 70.1 # 54
Action Classification Kinetics-400 R[2+1]D-RGB (Sports-1M pretrain) Acc@1 74.3 # 153
Acc@5 91.4 # 111
Action Classification Kinetics-400 R[2+1]D-Flow Acc@1 67.5 # 175
Acc@5 87.2 # 125
Action Recognition Sports-1M R[2+1]D-Flow-32frame Clip Hit@1 46.4 # 3
Video hit@1 68.4 # 6
Video hit@5 88.7 # 6
Action Recognition UCF101 R[2+1]D-TwoStream (Sports-1M pretrained) 3-fold Accuracy 95 # 46
Action Recognition UCF101 R[2+1]D-Flow (Sports-1M pretrained) 3-fold Accuracy 93.3 # 58
Action Recognition UCF101 R[2+1]D-RGB (Sports-1M pretrained) 3-fold Accuracy 93.6 # 56
Action Recognition UCF101 R[2+1]D-TwoStream (Kinetics pretrained) 3-fold Accuracy 97.3 # 18
Action Recognition UCF101 R[2+1]D-Flow (Kinetics pretrained) 3-fold Accuracy 95.5 # 41

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Uses Extra
Training Data
Source Paper Compare
Action Recognition HMDB-51 R[2+1]D-RGB (Kinetics pretrained) Average accuracy of 3 splits 74.5 # 40
Action Classification Kinetics-400 R[2+1]D-Flow (Sports-1M pretrain) Acc@1 75.4 # 146
Acc@5 91.9 # 109
Action Recognition UCF101 R[2+1]D-RGB (Kinetics pretrained) 3-fold Accuracy 96.8 # 27
Action Classification Kinetics-400 R[2+1]D Acc@1 72 # 165
Acc@5 90 # 120
Action Classification Kinetics-400 R[2+1]D-Two-Stream Acc@1 73.9 # 154
Acc@5 90.9 # 115
Action Classification Kinetics-400 R[2+1]D-RGB Acc@1 72 # 165
Acc@5 90 # 120
Action Recognition Sports-1M R[2+1]D-Two-Stream-32frame Video hit@1 73.3 # 3
Video hit@5 91.9 # 3
Action Recognition Sports-1M R[2+1]D-RGB-32frame Clip Hit@1 57 # 1
Video hit@1 73 # 4
Video hit@5 91.5 # 4

Methods