TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Temporal Action Localization	ActivityNet-1.3	LoFi+G-TAD	mAP IOU@0.5	50.91	# 20
Temporal Action Localization	ActivityNet-1.3	LoFi+G-TAD	mAP	34.96	# 21
Temporal Action Localization	ActivityNet-1.3	LoFi+G-TAD	mAP IOU@0.75	35.86	# 13
Temporal Action Localization	ActivityNet-1.3	LoFi+G-TAD	mAP IOU@0.95	8.79	# 10
Temporal Action Localization	HACS	LoFi+G-TAD (RGB, RN18)	Average-mAP	24.64	# 9
Temporal Action Localization	HACS	LoFi+G-TAD (RGB, RN18)	mAP@0.5	37.78	# 6
Temporal Action Localization	HACS	LoFi+G-TAD (RGB, RN18)	mAP@0.75	24.40	# 6
Temporal Action Localization	HACS	LoFi+G-TAD (RGB, RN18)	mAP@0.95	7.29	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/low-fidelity-video-encoder-optimization-for/temporal-action-localization-on-hacs)](https://paperswithcode.com/sota/temporal-action-localization-on-hacs?p=low-fidelity-video-encoder-optimization-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/low-fidelity-video-encoder-optimization-for/temporal-action-localization-on-activitynet)](https://paperswithcode.com/sota/temporal-action-localization-on-activitynet?p=low-fidelity-video-encoder-optimization-for)`

Low-Fidelity Video Encoder Optimization for Temporal Action Localization

NeurIPS 2021 · Mengmeng Xu, Juan Manuel Perez Rua, Xiatian Zhu, Bernard Ghanem, Brais Martinez ·

Most existing temporal action localization (TAL) methods rely on a transfer learning pipeline: by first optimizing a video encoder on a large action classification dataset (i.e., source domain), followed by freezing the encoder and training a TAL head on the action localization dataset (i.e., target domain). This results in a task discrepancy problem for the video encoder – trained for action classification, but used for TAL. Intuitively, joint optimization with both the video encoder and TAL head is a strong baseline solution to this discrepancy. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity (LoFi) video encoder optimization method. Instead of always using the full training configurations in TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial, or spatio-temporal resolution so that jointly optimizing the video encoder and TAL head becomes operable under the same memory conditions of a mid-range hardware budget. Crucially, this enables the gradients to flow backwards through the video encoder conditioned on a TAL supervision loss, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi optimization approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream (RGB + optical-flow) ResNet50 based alternatives, often by a good margin.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Classification

Action Localization

Optical Flow Estimation

Temporal Action Localization

Transfer Learning

Datasets

Kinetics

ActivityNet

Kinetics 400

HACS

Results from the Paper

Add Remove

Ranked #9 on Temporal Action Localization on HACS

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Temporal Action Localization	ActivityNet-1.3	LoFi+G-TAD	mAP IOU@0.5	50.91	# 20	Compare
			mAP	34.96	# 21	Compare
			mAP IOU@0.75	35.86	# 13	Compare
			mAP IOU@0.95	8.79	# 10	Compare
Temporal Action Localization	HACS	LoFi+G-TAD (RGB, RN18)	Average-mAP	24.64	# 9	Compare
			mAP@0.5	37.78	# 6	Compare
			mAP@0.75	24.40	# 6	Compare
			mAP@0.95	7.29	# 6	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Low-Fidelity Video Encoder Optimization for Temporal Action Localization

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove