HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

7 Oct 2023  ยท  Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne ยท

Instructional videos are a common source for learning text-video or even multimodal representations by leveraging subtitles extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. Specifically, we prompt an LLM to create plausible video captions based on ASR subtitles of instructional videos. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We further prompt the LLM to generate timestamps for each produced caption based on the timestamps of the subtitles and finally align the generated captions to the video temporally. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for zero-shot text-video retrieval and video captioning, but also lead to a disentangling of textual narration from the audio, boosting the performance in text-video-audio tasks.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-Shot Video Retrieval LSMDC HowToCaption text-to-video R@1 17.3 # 9
text-to-video R@5 31.7 # 10
text-to-video R@10 38.6 # 11
text-to-video Median Rank 29 # 4
Zero-Shot Video Retrieval LSMDC VAST, HowToCaption-finetuned text-to-video R@1 27.7 # 3
text-to-video R@5 46.5 # 3
text-to-video R@10 54.6 # 3
text-to-video Median Rank 7 # 1
Zero-Shot Video-Audio Retrieval MSR-VTT HowToCaption text-to-video+audio R@1 13.2 # 1
text-to-video+audio R@5 30.3 # 1
text-to-video+audio R@10 41.5 # 1
text-to-video+audio Median Rank 17 # 1
Zero-Shot Video Retrieval MSR-VTT HowToCaption text-to-video R@1 37.6 # 15
text-to-video R@5 62 # 14
text-to-video R@10 73.3 # 12
text-to-video Median Rank 3 # 4
Zero-Shot Video Retrieval MSR-VTT VAST, HowToCaption-finetuned text-to-video R@1 50 # 4
text-to-video R@5 73.2 # 3
text-to-video R@10 81.4 # 4
text-to-video Median Rank 1 # 1
Video Captioning MSR-VTT HowToCaption CIDEr 65.3 # 10
METEOR 32.2 # 6
ROUGE-L 66.3 # 6
BLEU-4 49.8 # 8
Video Captioning MSVD HowToCaption CIDEr 154.2 # 6
BLEU-4 70.4 # 6
METEOR 46.4 # 4
ROUGE-L 83.2 # 4
Zero-Shot Video Retrieval MSVD HowToCaption text-to-video R@1 44.5 # 8
text-to-video R@5 73.3 # 10
text-to-video R@10 82.1 # 10
text-to-video Median Rank 2 # 4
Zero-Shot Video Retrieval MSVD VAST, HowToCaption-finetuned text-to-video R@1 54.8 # 3
text-to-video R@5 80.9 # 4
text-to-video R@10 87.2 # 5
text-to-video Median Rank 1 # 1
Video Captioning YouCook2 HowToCaption BLEU-4 8.8 # 10
METEOR 15.9 # 7
ROUGE-L 37.3 # 8
CIDEr 116.4 # 1
Zero-Shot Video Retrieval YouCook2 HowToCaption text-to-video R@1 13.4 # 8
text-to-video R@5 33.1 # 8
text-to-video R@10 44.1 # 9
text-to-video Median Rank 15 # 2
Zero-Shot Video Retrieval YouCook2 VAST, HowToCaption-finetuned text-to-video R@1 19.7 # 6
text-to-video R@5 43.6 # 4
text-to-video R@10 53.9 # 5
text-to-video Median Rank 8 # 1
Zero-Shot Video-Audio Retrieval YouCook2 HowToCaption text-to-video+audio R@1 25.5 # 1
text-to-video+audio R@5 51.1 # 1
text-to-video+audio R@10 63.6 # 1
text-to-video+audio Median Rank 5 # 1

Methods