TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	CLIP-ViP	text-to-video R@1	61.4	# 8
Video Retrieval	ActivityNet	CLIP-ViP	text-to-video R@5	85.7	# 5
Video Retrieval	ActivityNet	CLIP-ViP	text-to-video R@10	92.6	# 6
Video Retrieval	ActivityNet	CLIP-ViP	text-to-video Median Rank	1	# 1
Video Retrieval	DiDeMo	CLIP-ViP	text-to-video R@1	55.3	# 14
Video Retrieval	DiDeMo	CLIP-ViP	text-to-video R@5	82	# 7
Video Retrieval	DiDeMo	CLIP-ViP	text-to-video R@10	89.3	# 8
Video Retrieval	DiDeMo	CLIP-ViP	text-to-video Median Rank	1	# 1
Video Retrieval	LSMDC	CLIP-ViP	text-to-video R@1	30.7	# 9
Video Retrieval	LSMDC	CLIP-ViP	text-to-video R@5	51.4	# 6
Video Retrieval	LSMDC	CLIP-ViP	text-to-video R@10	60.6	# 6
Video Retrieval	LSMDC	CLIP-ViP	text-to-video Median Rank	5	# 2
Video Retrieval	MSR-VTT-1kA	CLIP-ViP	text-to-video R@1	57.7	# 2
Video Retrieval	MSR-VTT-1kA	CLIP-ViP	text-to-video R@5	80.5	# 2
Video Retrieval	MSR-VTT-1kA	CLIP-ViP	text-to-video R@10	88.2	# 3
Video Retrieval	MSR-VTT-1kA	CLIP-ViP	text-to-video Median Rank	1.0	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip-vip-adapting-pre-trained-image-text/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=clip-vip-adapting-pre-trained-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip-vip-adapting-pre-trained-image-text/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=clip-vip-adapting-pre-trained-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip-vip-adapting-pre-trained-image-text/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=clip-vip-adapting-pre-trained-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip-vip-adapting-pre-trained-image-text/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=clip-vip-adapting-pre-trained-image-text)`

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

14 Sep 2022 · Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo ·

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

PDF Abstract

Code

Add Remove Mark official

microsoft/xpretrain official

435

Tasks

Add Remove

Retrieval

Text Retrieval

Video Retrieval

Video-Text Retrieval

Datasets

MS COCO

Flickr30k

ActivityNet

MSR-VTT

ActivityNet Captions

DiDeMo

WebVid

LSMDC

Results from the Paper

Edit

Ranked #2 on Video Retrieval on MSR-VTT-1kA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	CLIP-ViP	text-to-video R@1	61.4	# 8	Compare
			text-to-video R@5	85.7	# 5	Compare
			text-to-video R@10	92.6	# 6	Compare
			text-to-video Median Rank	1	# 1	Compare
Video Retrieval	DiDeMo	CLIP-ViP	text-to-video R@1	55.3	# 14	Compare
			text-to-video R@5	82	# 7	Compare
			text-to-video R@10	89.3	# 8	Compare
			text-to-video Median Rank	1	# 1	Compare
Video Retrieval	LSMDC	CLIP-ViP	text-to-video R@1	30.7	# 9	Compare
			text-to-video R@5	51.4	# 6	Compare
			text-to-video R@10	60.6	# 6	Compare
			text-to-video Median Rank	5	# 2	Compare
Video Retrieval	MSR-VTT-1kA	CLIP-ViP	text-to-video R@1	57.7	# 2	Compare
			text-to-video R@5	80.5	# 2	Compare
			text-to-video R@10	88.2	# 3	Compare
			text-to-video Median Rank	1.0	# 1	Compare

Methods

Add Remove

CLIP

Edit Social Preview

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove