CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

18 Apr 2021  ·  Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li ·

Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo. We release our code at https://github.com/ArrowLuo/CLIP4Clip.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Retrieval ActivityNet CLIP4Clip text-to-video R@1 40.5 # 17
text-to-video R@5 73.4 # 13
text-to-video R@50 98.2 # 1
text-to-video Median Rank 2 # 4
text-to-video Mean Rank 7.5 # 10
Video Retrieval DiDeMo CLIP4Clip text-to-video R@1 43.4 # 25
text-to-video R@5 70.2 # 24
text-to-video R@10 80.6 # 22
text-to-video Median Rank 2.0 # 7
text-to-video Mean Rank 17.5 # 10
Zero-Shot Video Retrieval LSMDC CLIP4Clip text-to-video R@1 15.1 # 6
text-to-video R@5 28.5 # 7
text-to-video R@10 36.4 # 7
text-to-video Median Rank 28 # 2
text-to-video Mean Rank 117 # 1
Video Retrieval LSMDC CLIP4Clip text-to-video R@1 21.6 # 21
text-to-video R@5 41.8 # 18
text-to-video R@10 49.8 # 18
text-to-video Mean Rank 58.0 # 9
Zero-Shot Video Retrieval MSR-VTT CLIP4Clip text-to-video R@1 32.0 # 12
text-to-video R@5 57.0 # 10
text-to-video R@10 66.9 # 9
text-to-video Median Rank 4 # 2
text-to-video Mean Rank 34.0 # 2
Video Retrieval MSR-VTT CLIP4Clip text-to-video R@5 71.4 # 7
Text to Video Retrieval MSR-VTT CLIP4Clip text-to-video R@1 44.5 # 1
Video Retrieval MSR-VTT-1kA CLIP4Clip text-to-video Mean Rank 15.3 # 15
text-to-video R@10 81.6 # 27
text-to-video Median Rank 2 # 7
video-to-text R@1 42.7 # 18
video-to-text R@5 70.9 # 17
video-to-text R@10 80.6 # 17
video-to-text Median Rank 2 # 5
Zero-Shot Video Retrieval MSVD CLIP4Clip text-to-video R@1 38.5 # 4
text-to-video R@5 66.9 # 3
text-to-video R@10 76.8 # 3
text-to-video Median Rank 2 # 1
text-to-video Mean Rank 17.8 # 1
Video Retrieval MSVD CLIP4Clip text-to-video R@1 46.2 # 15
text-to-video R@5 76.1 # 13
text-to-video R@10 84.6 # 12
text-to-video Median Rank 2 # 7
text-to-video Mean Rank 10.0 # 9
video-to-text R@1 62.0 # 10
video-to-text R@5 87.3 # 9
video-to-text R@10 92.6 # 8
video-to-text Median Rank 1 # 1

Methods