VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Results from the Paper


 Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Action Segmentation COIN VideoClip Frame accuracy 68.7 # 4
Temporal Action Localization CrossTask VideoCLIP Recall 47.3 # 1
Zero-Shot Video Retrieval DiDeMo VideoCLIP text-to-video R@1 16.6 # 22
text-to-video R@5 46.9 # 19
Zero-Shot Video Retrieval MSR-VTT VideoCLIP text-to-video R@1 10.4 # 28
text-to-video R@5 22.2 # 29
text-to-video R@10 30.0 # 28
Video Retrieval MSR-VTT-1kA VideoCLIP text-to-video R@1 30.9 # 45
text-to-video R@5 55.4 # 46
text-to-video R@10 66.8 # 50
Video Retrieval YouCook2 VideoCLIP text-to-video R@1 32.2 # 3
text-to-video R@10 75.0 # 2
text-to-video R@5 62.6 # 4
Zero-Shot Video Retrieval YouCook2 VideoCLIP text-to-video R@1 22.7 # 2
text-to-video R@5 50.4 # 2
text-to-video R@10 63.1 # 2
Video Retrieval YouCook2 VideoCLIP (zero-shot) text-to-video R@1 22.7 # 8
text-to-video R@10 63.1 # 10
text-to-video R@5 50.4 # 8
Long Video Retrieval (Background Removed) YouCook2 VideoCLIP Cap. Avg. R@1 74.5 # 2
Cap. Avg. R@5 94.5 # 3
Cap. Avg. R@10 97.9 # 1
DTW R@1 56.0 # 3
DTW R@5 96.3 # 3
DTW R@10 89.9 # 3
OTAM R@1 52.8 # 3
OTAM R@5 95.0 # 3
OTAM R@10 89.2 # 3

Methods


No methods listed for this paper. Add relevant methods here