VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Results from the Paper

 Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Temporal Action Localization CrossTask VideoCLIP Recall 47.3 # 1
Video Retrieval MSR-VTT-1kA VideoCLIP text-to-video R@1 30.9 # 20
text-to-video R@5 55.4 # 22
text-to-video R@10 66.8 # 23
Video Retrieval YouCook2 VideoCLIP (zero-shot) text-to-video R@1 22.7 # 6
text-to-video R@10 63.1 # 6
text-to-video R@5 50.4 # 6
Video Retrieval YouCook2 VideoCLIP text-to-video R@1 32.2 # 1
text-to-video R@10 75.0 # 1
text-to-video R@5 62.6 # 2


No methods listed for this paper. Add relevant methods here