Low-Fidelity Video Encoder Optimization for Temporal Action Localization

Most existing temporal action localization (TAL) methods rely on a transfer learning pipeline: by first optimizing a video encoder on a large action classification dataset (i.e., source domain), followed by freezing the encoder and training a TAL head on the action localization dataset (i.e., target domain). This results in a task discrepancy problem for the video encoder – trained for action classification, but used for TAL. Intuitively, joint optimization with both the video encoder and TAL head is a strong baseline solution to this discrepancy. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity (LoFi) video encoder optimization method. Instead of always using the full training configurations in TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial, or spatio-temporal resolution so that jointly optimizing the video encoder and TAL head becomes operable under the same memory conditions of a mid-range hardware budget. Crucially, this enables the gradients to flow backwards through the video encoder conditioned on a TAL supervision loss, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi optimization approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream (RGB + optical-flow) ResNet50 based alternatives, often by a good margin.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Temporal Action Localization ActivityNet-1.3 LoFi+G-TAD mAP IOU@0.5 50.91 # 20
mAP 34.96 # 21
mAP IOU@0.75 35.86 # 13
mAP IOU@0.95 8.79 # 10
Temporal Action Localization HACS LoFi+G-TAD (RGB, RN18) Average-mAP 24.64 # 9
mAP@0.5 37.78 # 6
mAP@0.75 24.40 # 6
mAP@0.95 7.29 # 6

Methods


No methods listed for this paper. Add relevant methods here