TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Moment Retrieval	Charades-STA	UMT (VA)	R@1 IoU=0.5	48.31	# 15
Moment Retrieval	Charades-STA	UMT (VA)	R@1 IoU=0.7	29.25	# 14
Moment Retrieval	Charades-STA	UMT (VA)	R@5 IoU=0.5	88.79	# 3
Moment Retrieval	Charades-STA	UMT (VA)	R@5 IoU=0.7	56.08	# 4
Moment Retrieval	Charades-STA	UMT (VO)	R@1 IoU=0.5	49.35	# 14
Moment Retrieval	Charades-STA	UMT (VO)	R@1 IoU=0.7	26.16	# 16
Moment Retrieval	Charades-STA	UMT (VO)	R@5 IoU=0.5	89.41	# 2
Moment Retrieval	Charades-STA	UMT (VO)	R@5 IoU=0.7	54.95	# 6
Moment Retrieval	QVHighlights	UMT (w/ audio + PT ASR Cpations)	mAP	38.08	# 16
Video Grounding	QVHighlights	UMT	R@1,IoU=0.5	56.23	# 5
Video Grounding	QVHighlights	UMT	R@1,IoU=0.7	41.18	# 5
Moment Retrieval	QVHighlights	UMT (w/ audio)	mAP	36.12	# 18
Highlight Detection	QVHighlights	UMT (w. PT)	mAP	39.12	# 6
Highlight Detection	QVHighlights	UMT	mAP	38.18	# 11
Highlight Detection	TvSum	UMT	mAP	83.1	# 5
Highlight Detection	YouTube Highlights	UMT	mAP	74.9	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/umt-unified-multi-modal-transformers-for/highlight-detection-on-youtube-highlights)](https://paperswithcode.com/sota/highlight-detection-on-youtube-highlights?p=umt-unified-multi-modal-transformers-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/umt-unified-multi-modal-transformers-for/video-grounding-on-qvhighlights)](https://paperswithcode.com/sota/video-grounding-on-qvhighlights?p=umt-unified-multi-modal-transformers-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/umt-unified-multi-modal-transformers-for/highlight-detection-on-tvsum)](https://paperswithcode.com/sota/highlight-detection-on-tvsum?p=umt-unified-multi-modal-transformers-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/umt-unified-multi-modal-transformers-for/highlight-detection-on-qvhighlights)](https://paperswithcode.com/sota/highlight-detection-on-qvhighlights?p=umt-unified-multi-modal-transformers-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/umt-unified-multi-modal-transformers-for/moment-retrieval-on-charades-sta)](https://paperswithcode.com/sota/moment-retrieval-on-charades-sta?p=umt-unified-multi-modal-transformers-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/umt-unified-multi-modal-transformers-for/moment-retrieval-on-qvhighlights)](https://paperswithcode.com/sota/moment-retrieval-on-qvhighlights?p=umt-unified-multi-modal-transformers-for)`

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

CVPR 2022 · Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, XiaoHu Qie ·

Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era. Nevertheless, jointly conducting moment retrieval and highlight detection is an emerging research topic, even though its component problems and some related tasks have already been studied for a while. In this paper, we present the first unified framework, named Unified Multi-modal Transformers (UMT), capable of realizing such joint optimization while can also be easily degenerated for solving individual problems. As far as we are aware, this is the first scheme to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task, and tackles moment retrieval as a keypoint detection problem using a novel query generator and query decoder. Extensive comparisons with existing methods and ablation studies on QVHighlights, Charades-STA, YouTube Highlights, and TVSum datasets demonstrate the effectiveness, superiority, and flexibility of the proposed method under various settings. Source code and pre-trained models are available at https://github.com/TencentARC/UMT.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

tencentarc/umt official

176

Tasks

Add Remove

Highlight Detection

Moment Retrieval

Natural Language Queries

Retrieval

Video Grounding

Datasets

Charades-STA TVSum

QVHighlights

Results from the Paper

Edit

Ranked #3 on Highlight Detection on YouTube Highlights

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Moment Retrieval	Charades-STA	UMT (VA)	R@1 IoU=0.5	48.31	# 15	Compare
			R@1 IoU=0.7	29.25	# 14	Compare
			R@5 IoU=0.5	88.79	# 3	Compare
			R@5 IoU=0.7	56.08	# 4	Compare
Moment Retrieval	Charades-STA	UMT (VO)	R@1 IoU=0.5	49.35	# 14	Compare
			R@1 IoU=0.7	26.16	# 16	Compare
			R@5 IoU=0.5	89.41	# 2	Compare
			R@5 IoU=0.7	54.95	# 6	Compare
Moment Retrieval	QVHighlights	UMT (w/ audio + PT ASR Cpations)	mAP	38.08	# 16	Compare
Video Grounding	QVHighlights	UMT	R@1,IoU=0.5	56.23	# 5	Compare
Video Grounding	QVHighlights	UMT	R@1,IoU=0.7	41.18	# 5	Compare
Moment Retrieval	QVHighlights	UMT (w/ audio)	mAP	36.12	# 18	Compare
Highlight Detection	QVHighlights	UMT (w. PT)	mAP	39.12	# 6	Compare
Highlight Detection	QVHighlights	UMT	mAP	38.18	# 11	Compare
Highlight Detection	TvSum	UMT	mAP	83.1	# 5	Compare
Highlight Detection	YouTube Highlights	UMT	mAP	74.9	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove