Multimodal Pretraining for Dense Video Captioning

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

PDF Abstract Asian Chapter 2020 PDF Asian Chapter 2020 Abstract


Results from the Paper

 Ranked #1 on Dense Video Captioning on YouCook2 (ROUGE-L metric, using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Dense Video Captioning YouCook2 E2vidD6-MASSalign-BiD ROUGE-L 39.03 # 1
Video Captioning YouCook2 E2vidD6-MASSvid-BiD BLEU-4 12.04 # 4
METEOR 18.32 # 3
ROUGE-L 39.03 # 3
CIDEr 1.22 # 4


No methods listed for this paper. Add relevant methods here