MotionSqueeze: Neural Motion Feature Learning for Video Understanding

ECCV 2020  ยท  Heeseung Kwon, Manjin Kim, Suha Kwak, Minsu Cho ยท

Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information typically using optical flows extracted by a separate off-the-shelf method. As the frame-by-frame optical flows require heavy computation, incorporating motion information has remained a major computational bottleneck for video understanding. In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features. We propose a trainable neural module, dubbed MotionSqueeze, for effective motion feature extraction. Inserted in the middle of any neural network, it learns to establish correspondences across frames and convert them into motion features, which are readily fed to the next downstream layer for better prediction. We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost, outperforming the state of the art on Something-Something-V1&V2 datasets.

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition HMDB-51 MSNet-R50 (16 frames, ImageNet pretrained) Average accuracy of 3 splits 77.4 # 31
Action Classification Kinetics-400 MSNet-R50 (16 frames, ImageNet pretrained) Acc@1 76.4 # 141
Action Recognition Something-Something V1 MSNet-R50 (8 frames, ImageNet pretrained) Top 1 Accuracy 50.9 # 48
Top 5 Accuracy 80.3 # 25
Action Recognition Something-Something V1 MSNet-R50En (ensemble) Top 1 Accuracy 55.1 # 27
Video Classification Something-Something V1 MSNet-R50En (ours) Top-5 Accuracy 84 # 1
Action Recognition Something-Something V1 MSNet-R50En (8+16 ensemble, ImageNet pretrained) Top 1 Accuracy 54.4 # 30
Top 5 Accuracy 83.8 # 13
Action Recognition Something-Something V1 MSNet-R50 (16 frames, ImageNet pretrained) Top 1 Accuracy 52.1 # 43
Top 5 Accuracy 82.3 # 18
Action Recognition Something-Something V2 MSNet-R50 (16 frames, ImageNet pretrained) Top-1 Accuracy 64.7 # 92
Top-5 Accuracy 89.4 # 70
Action Recognition Something-Something V2 MSNet-R50En (8+16 ensemble, ImageNet pretrained) Top-1 Accuracy 66.6 # 75
Top-5 Accuracy 90.6 # 51
Video Classification Something-Something V2 MSNet-R50En (ours) Top-5 Accuracy 91 # 1
Action Recognition Something-Something V2 MSNet-R50 (8 frames, ImageNet pretrained) Top-1 Accuracy 63 # 99
Top-5 Accuracy 88.4 # 78

Methods


No methods listed for this paper. Add relevant methods here