STM: SpatioTemporal and Motion Encoding for Action Recognition

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.

PDF Abstract ICCV 2019 PDF ICCV 2019 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Recognition In Videos HMDB-51 STM (ImageNet+Kinetics pretrain) Average accuracy of 3 splits 72.2 # 1
Action Recognition In Videos Jester (Gesture Recognition) STM (Resnet-50, 16 frames) Val 96.7 # 1
Action Classification Kinetics-400 STM (ResNet-50) Acc@1 73.7 # 162
Action Recognition In Videos Something-Something V1 STM (16 frames, ImageNet pretraining) Top 1 Accuracy 50.7 # 1
Action Recognition In Videos Something-Something V2 STM (16 frames, ImageNet pretraining) Top-1 Accuracy 64.2 # 1
Top-5 Accuracy 89.8 # 1
Action Recognition In Videos UCF101 STM (ImageNet+Kinetics pretrain) 3-fold Accuracy 96.2 # 1

Methods