VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Results from the Paper


 Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Temporal Action Localization CrossTask VideoCLIP Recall 47.3 # 1
Video Retrieval MSR-VTT-1kA VideoCLIP text-to-video R@1 30.9 # 20
text-to-video R@5 55.4 # 22
text-to-video R@10 66.8 # 23
Video Retrieval YouCook2 VideoCLIP (zero-shot) text-to-video R@1 22.7 # 6
text-to-video R@10 63.1 # 6
text-to-video R@5 50.4 # 6
Video Retrieval YouCook2 VideoCLIP text-to-video R@1 32.2 # 1
text-to-video R@10 75.0 # 1
text-to-video R@5 62.6 # 2

Methods


No methods listed for this paper. Add relevant methods here