TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Highlight Detection	QVHighlights	R^2-Tuning	mAP	40.75	# 2
Highlight Detection	QVHighlights	R^2-Tuning	Hit@1	64.20	# 5
Moment Retrieval	QVHighlights	R^2-Tuning	mAP	46.17	# 4
Moment Retrieval	QVHighlights	R^2-Tuning	R@1 IoU=0.5	68.03	# 2
Moment Retrieval	QVHighlights	R^2-Tuning	R@1 IoU=0.7	49.35	# 4
Moment Retrieval	QVHighlights	R^2-Tuning	mAP@0.5	69.04	# 2
Moment Retrieval	QVHighlights	R^2-Tuning	mAP@0.75	47.56	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/r-2-tuning-efficient-image-to-video-transfer-1/highlight-detection-on-qvhighlights)](https://paperswithcode.com/sota/highlight-detection-on-qvhighlights?p=r-2-tuning-efficient-image-to-video-transfer-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/r-2-tuning-efficient-image-to-video-transfer-1/moment-retrieval-on-qvhighlights)](https://paperswithcode.com/sota/moment-retrieval-on-qvhighlights?p=r-2-tuning-efficient-image-to-video-transfer-1)`

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

31 Mar 2024 · Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, Chang Wen Chen ·

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning ($R^2$-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight $R^2$ Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, $R^2$ Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. $R^2$-Tuning achieves state-of-the-art performance across three VTG tasks (i.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.

PDF Abstract

Code

Add Remove Mark official

yeliudev/R2-Tuning official

Tasks

Add Remove

Highlight Detection

Moment Retrieval

Natural Language Queries

Transfer Learning

Video Summarization

Video Understanding

Datasets

Charades

Charades-STA TVSum

QVHighlights

Results from the Paper

Edit

Ranked #2 on Highlight Detection on QVHighlights

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Highlight Detection	QVHighlights	R^2-Tuning	mAP	40.75	# 2	Compare
Highlight Detection	QVHighlights	R^2-Tuning	Hit@1	64.20	# 5	Compare
Moment Retrieval	QVHighlights	R^2-Tuning	mAP	46.17	# 4	Compare
			R@1 IoU=0.5	68.03	# 2	Compare
			R@1 IoU=0.7	49.35	# 4	Compare
			mAP@0.5	69.04	# 2	Compare
			mAP@0.75	47.56	# 3	Compare

Methods

Add Remove

CLIP

Edit Social Preview

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove