VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.
PDF Abstract EMNLP 2021 PDF EMNLP 2021 AbstractCode
Results from the Paper
Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Action Segmentation | COIN | VideoClip | Frame accuracy | 68.7 | # 4 | ||
Temporal Action Localization | CrossTask | VideoCLIP | Recall | 47.3 | # 1 | ||
Zero-Shot Video Retrieval | DiDeMo | VideoCLIP | text-to-video R@1 | 16.6 | # 25 | ||
text-to-video R@5 | 46.9 | # 22 | |||||
Zero-Shot Video Retrieval | MSR-VTT | VideoCLIP | text-to-video R@1 | 10.4 | # 31 | ||
text-to-video R@5 | 22.2 | # 32 | |||||
text-to-video R@10 | 30.0 | # 31 | |||||
Video Retrieval | MSR-VTT-1kA | VideoCLIP | text-to-video R@1 | 30.9 | # 45 | ||
text-to-video R@5 | 55.4 | # 46 | |||||
text-to-video R@10 | 66.8 | # 50 | |||||
Zero-Shot Video Retrieval | YouCook2 | VideoCLIP | text-to-video R@1 | 22.7 | # 2 | ||
text-to-video R@5 | 50.4 | # 2 | |||||
text-to-video R@10 | 63.1 | # 2 | |||||
Long Video Retrieval (Background Removed) | YouCook2 | VideoCLIP | Cap. Avg. R@1 | 74.5 | # 2 | ||
Cap. Avg. R@5 | 94.5 | # 3 | |||||
Cap. Avg. R@10 | 97.9 | # 1 | |||||
DTW R@1 | 56.0 | # 3 | |||||
DTW R@5 | 96.3 | # 3 | |||||
DTW R@10 | 89.9 | # 3 | |||||
OTAM R@1 | 52.8 | # 3 | |||||
OTAM R@5 | 95.0 | # 3 | |||||
OTAM R@10 | 89.2 | # 3 | |||||
Video Retrieval | YouCook2 | VideoCLIP (zero-shot) | text-to-video R@1 | 22.7 | # 8 | ||
text-to-video R@10 | 63.1 | # 10 | |||||
text-to-video R@5 | 50.4 | # 8 | |||||
Video Retrieval | YouCook2 | VideoCLIP | text-to-video R@1 | 32.2 | # 3 | ||
text-to-video R@10 | 75.0 | # 2 | |||||
text-to-video R@5 | 62.6 | # 4 |