Attention Distillation for Learning Video Representations

5 Apr 2019  ·  Miao Liu, Xin Chen, Yun Zhang, Yin Li, James M. Rehg ·

We address the challenging problem of learning motion representations using deep models for video recognition. To this end, we make use of attention modules that learn to highlight regions in the video and aggregate features for recognition. Specifically, we propose to leverage output attention maps as a vehicle to transfer the learned representation from a motion (flow) network to an RGB network. We systematically study the design of attention modules, and develop a novel method for attention distillation. Our method is evaluated on major action benchmarks, and consistently improves the performance of the baseline RGB network by a significant margin. Moreover, we demonstrate that our attention maps can leverage motion cues in learning to identify the location of actions in video frames. We believe our method provides a step towards learning motion-aware representations in deep models. Our project page is available at https://aptx4869lm.github.io/AttentionDistillation/

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Recognition HMDB-51 Prob-Distill Average accuracy of 3 splits 72.0 # 48
Action Recognition Something-Something V2 Prob-Distill Top-1 Accuracy 49.9 # 116
Top-5 Accuracy 79.1 # 85
Action Recognition UCF101 Prob-Distill 3-fold Accuracy 95.7 # 39

Methods


No methods listed for this paper. Add relevant methods here