TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Captioning	YouCook2	E2vidD6-MASSvid-BiD	BLEU-4	12.04	# 6
Video Captioning	YouCook2	E2vidD6-MASSvid-BiD	METEOR	18.32	# 4
Video Captioning	YouCook2	E2vidD6-MASSvid-BiD	ROUGE-L	39.03	# 5
Video Captioning	YouCook2	E2vidD6-MASSvid-BiD	CIDEr	1.22	# 9
Dense Video Captioning	YouCook2	E2vidD6-MASSalign-BiD	ROUGE-L	39.03	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-pretraining-for-dense-video/dense-video-captioning-on-youcook2)](https://paperswithcode.com/sota/dense-video-captioning-on-youcook2?p=multimodal-pretraining-for-dense-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-pretraining-for-dense-video/video-captioning-on-youcook2)](https://paperswithcode.com/sota/video-captioning-on-youcook2?p=multimodal-pretraining-for-dense-video)`

Multimodal Pretraining for Dense Video Captioning

Asian Chapter of the Association for Computational Linguistics 2020 · Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut ·

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

PDF Abstract Asian Chapter 2020 PDF Asian Chapter 2020 Abstract

Code

Add Remove Mark official

google-research-datasets/Video-Time… official

Tasks

Add Remove

Dense Video Captioning

Video Captioning

Datasets

Introduced in the Paper:

ViTT

Used in the Paper:

Kinetics

HowTo100M

YouCook2

YouTube-8M

WikiHow

Recipe1M+

Results from the Paper

Edit

Ranked #1 on Dense Video Captioning on YouCook2 (ROUGE-L metric, using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Captioning	YouCook2	E2vidD6-MASSvid-BiD	BLEU-4	12.04	# 6	Compare
			METEOR	18.32	# 4	Compare
			ROUGE-L	39.03	# 5	Compare
			CIDEr	1.22	# 9	Compare
Dense Video Captioning	YouCook2	E2vidD6-MASSalign-BiD	ROUGE-L	39.03	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Multimodal Pretraining for Dense Video Captioning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove