TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Weakly Supervised Action Localization	FineAction	HAAN	mAP	4.10	# 1
Weakly Supervised Action Localization	FineAction	HAAN	mAP IOU@0.5	7.05	# 1
Weakly Supervised Action Localization	FineAction	HAAN	mAP IOU@0.75	3.95	# 1
Weakly Supervised Action Localization	FineAction	HAAN	mAP IOU@0.95	1.14	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/weakly-supervised-temporal-action-detection/weakly-supervised-action-localization-on-7)](https://paperswithcode.com/sota/weakly-supervised-action-localization-on-7?p=weakly-supervised-temporal-action-detection)`

Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions

24 Jul 2022 · Zhi Li, Lu He, Huijuan Xu ·

Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time. Without the careful design to capture subtle differences between fine-grained actions, previous weakly-supervised models for general action detection cannot perform well in the fine-grained setting. We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data through self-supervised clustering, in order to capture the commonality and individuality of fine-grained actions. The learnt atomic actions, represented by visual concepts, are further mapped to fine and coarse action labels leveraging the semantic label hierarchy. Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level. Extensive experiments on two large-scale fine-grained video datasets, FineAction and FineGym, show the benefit of our proposed weakly-supervised model for fine-grained action detection, and it achieves state-of-the-art results.

PDF Abstract