VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Action Segmentation COIN VideoClip Frame accuracy 68.7 # 3
Temporal Action Localization CrossTask VideoCLIP Recall 47.3 # 1
Zero-Shot Video Retrieval DiDeMo VideoCLIP text-to-video R@1 16.6 # 15
text-to-video R@5 46.9 # 13
Zero-Shot Video Retrieval MSR-VTT VideoCLIP text-to-video R@1 10.4 # 20
text-to-video R@5 22.2 # 20
text-to-video R@10 30.0 # 20
Video Retrieval MSR-VTT-1kA VideoCLIP text-to-video R@1 30.9 # 40
text-to-video R@5 55.4 # 41
text-to-video R@10 66.8 # 43
Video Retrieval YouCook2 VideoCLIP text-to-video R@1 32.2 # 3
text-to-video R@10 75.0 # 2
text-to-video R@5 62.6 # 4
Video Retrieval YouCook2 VideoCLIP (zero-shot) text-to-video R@1 22.7 # 8
text-to-video R@10 63.1 # 8
text-to-video R@5 50.4 # 8
Zero-Shot Video Retrieval YouCook2 VideoCLIP text-to-video R@1 22.7 # 1
text-to-video R@5 50.4 # 1
text-to-video R@10 63.1 # 1

Methods


No methods listed for this paper. Add relevant methods here