TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Natural Language Moment Retrieval	ActivityNet Captions	GVL (paragraph-level)	R@1,IoU=0.5	60.67	# 1
Natural Language Moment Retrieval	ActivityNet Captions	GVL (paragraph-level)	R@1,IoU=0.7	38.55	# 1
Natural Language Moment Retrieval	ActivityNet Captions	GVL	R@1,IoU=0.5	49.18	# 2
Natural Language Moment Retrieval	ActivityNet Captions	GVL	R@1,IoU=0.7	29.69	# 5
Dense Video Captioning	ActivityNet Captions	GVL	METEOR	10.03	# 4
Dense Video Captioning	ActivityNet Captions	GVL	CIDEr	33.33	# 1
Dense Video Captioning	ActivityNet Captions	GVL	SODA	7.11	# 1
Natural Language Moment Retrieval	TACoS	GVL (paragraph-level)	R@1,IoU=0.3	48.29	# 4
Natural Language Moment Retrieval	TACoS	GVL (paragraph-level)	R@1,IoU=0.5	36.07	# 4
Natural Language Moment Retrieval	TACoS	GVL	R@1,IoU=0.3	45.92	# 5
Natural Language Moment Retrieval	TACoS	GVL	R@1,IoU=0.5	34.57	# 6
Dense Video Captioning	YouCook2	GVL	METEOR	5.01	# 4
Dense Video Captioning	YouCook2	GVL	CIDEr	26.52	# 4
Dense Video Captioning	YouCook2	GVL	SODA	4.91	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-grounded-vision-language/natural-language-moment-retrieval-on)](https://paperswithcode.com/sota/natural-language-moment-retrieval-on?p=learning-grounded-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-grounded-vision-language/dense-video-captioning-on-activitynet)](https://paperswithcode.com/sota/dense-video-captioning-on-activitynet?p=learning-grounded-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-grounded-vision-language/natural-language-moment-retrieval-on-tacos)](https://paperswithcode.com/sota/natural-language-moment-retrieval-on-tacos?p=learning-grounded-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-grounded-vision-language/dense-video-captioning-on-youcook2)](https://paperswithcode.com/sota/dense-video-captioning-on-youcook2?p=learning-grounded-vision-language)`

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

11 Mar 2023 · Teng Wang, Jinrui Zhang, Feng Zheng, Wenhao Jiang, Ran Cheng, Ping Luo ·

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publicly available at https://github.com/zjr2000/GVL.

PDF Abstract

Code

Add Remove Mark official

zjr2000/gvl official

Tasks

Add Remove

Dense Video Captioning

Natural Language Moment Retrieval

Sentence

Text Generation

Video Captioning

Datasets

ActivityNet

ActivityNet Captions

YouCook2

Results from the Paper

Edit

Ranked #1 on Natural Language Moment Retrieval on ActivityNet Captions

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Natural Language Moment Retrieval	ActivityNet Captions	GVL (paragraph-level)	R@1,IoU=0.5	60.67	# 1	Compare
Natural Language Moment Retrieval	ActivityNet Captions	GVL (paragraph-level)	R@1,IoU=0.7	38.55	# 1	Compare
Natural Language Moment Retrieval	ActivityNet Captions	GVL	R@1,IoU=0.5	49.18	# 2	Compare
Natural Language Moment Retrieval	ActivityNet Captions	GVL	R@1,IoU=0.7	29.69	# 5	Compare
Dense Video Captioning	ActivityNet Captions	GVL	METEOR	10.03	# 4	Compare
			CIDEr	33.33	# 1	Compare
			SODA	7.11	# 1	Compare
Natural Language Moment Retrieval	TACoS	GVL (paragraph-level)	R@1,IoU=0.3	48.29	# 4	Compare
Natural Language Moment Retrieval	TACoS	GVL (paragraph-level)	R@1,IoU=0.5	36.07	# 4	Compare
Natural Language Moment Retrieval	TACoS	GVL	R@1,IoU=0.3	45.92	# 5	Compare
Natural Language Moment Retrieval	TACoS	GVL	R@1,IoU=0.5	34.57	# 6	Compare
Dense Video Captioning	YouCook2	GVL	METEOR	5.01	# 4	Compare
			CIDEr	26.52	# 4	Compare
			SODA	4.91	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove