Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions

24 Jul 2022  ·  Zhi Li, Lu He, Huijuan Xu ·

Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time. Without the careful design to capture subtle differences between fine-grained actions, previous weakly-supervised models for general action detection cannot perform well in the fine-grained setting. We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data through self-supervised clustering, in order to capture the commonality and individuality of fine-grained actions. The learnt atomic actions, represented by visual concepts, are further mapped to fine and coarse action labels leveraging the semantic label hierarchy. Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level. Extensive experiments on two large-scale fine-grained video datasets, FineAction and FineGym, show the benefit of our proposed weakly-supervised model for fine-grained action detection, and it achieves state-of-the-art results.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Weakly Supervised Action Localization FineAction HAAN mAP 4.10 # 1
mAP IOU@0.5 7.05 # 1
mAP IOU@0.75 3.95 # 1
mAP IOU@0.95 1.14 # 2

Methods