TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	MSR-VTT	CLIP2TV	text-to-video R@1	33.1	# 20
Video Retrieval	MSR-VTT	CLIP2TV	text-to-video R@5	58.9	# 18
Video Retrieval	MSR-VTT	CLIP2TV	text-to-video R@10	68.9	# 18
Video Retrieval	MSR-VTT	CLIP2TV	text-to-video Mean Rank	44.7	# 3
Video Retrieval	MSR-VTT	CLIP2TV	text-to-video Median Rank	3	# 1
Video Retrieval	MSR-VTT-1kA	CLIP2TV	text-to-video Mean Rank	12.8	# 13
Video Retrieval	MSR-VTT-1kA	CLIP2TV	text-to-video R@1	52.9	# 12
Video Retrieval	MSR-VTT-1kA	CLIP2TV	text-to-video R@5	78.5	# 8
Video Retrieval	MSR-VTT-1kA	CLIP2TV	text-to-video R@10	86.5	# 9
Video Retrieval	MSR-VTT-1kA	CLIP2TV	text-to-video Median Rank	1	# 1
Video Retrieval	MSR-VTT-1kA	CLIP2TV	video-to-text R@1	54.1	# 6
Video Retrieval	MSR-VTT-1kA	CLIP2TV	video-to-text R@5	77.4	# 6
Video Retrieval	MSR-VTT-1kA	CLIP2TV	video-to-text R@10	85.7	# 7
Video Retrieval	MSR-VTT-1kA	CLIP2TV	video-to-text Median Rank	1	# 1
Video Retrieval	MSR-VTT-1kA	CLIP2TV	video-to-text Mean Rank	9.0	# 13

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip2tv-an-empirical-study-on-transformer/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=clip2tv-an-empirical-study-on-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip2tv-an-empirical-study-on-transformer/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=clip2tv-an-empirical-study-on-transformer)`

CLIP2TV: Align, Match and Distill for Video-Text Retrieval

10 Nov 2021 · Zijian Gao, Jingyu Liu, Weiqi Sun, Sheng Chen, Dedan Chang, Lili Zhao ·

Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head. With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval. In this report, we present CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods. To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations. Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset, outperforming the previous SOTA result by 4.1%.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Representation Learning

Retrieval

Text Retrieval

Video Retrieval

Video-Text Retrieval

Datasets

MSR-VTT

MSVD

DiDeMo

VATEX

Results from the Paper

Edit

Ranked #12 on Video Retrieval on MSR-VTT-1kA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	MSR-VTT	CLIP2TV	text-to-video R@1	33.1	# 20	Compare
			text-to-video R@5	58.9	# 18	Compare
			text-to-video R@10	68.9	# 18	Compare
			text-to-video Mean Rank	44.7	# 3	Compare
			text-to-video Median Rank	3	# 1	Compare
Video Retrieval	MSR-VTT-1kA	CLIP2TV	text-to-video Mean Rank	12.8	# 13	Compare
			text-to-video R@1	52.9	# 12	Compare
			text-to-video R@5	78.5	# 8	Compare
			text-to-video R@10	86.5	# 9	Compare
			text-to-video Median Rank	1	# 1	Compare
			video-to-text R@1	54.1	# 6	Compare
			video-to-text R@5	77.4	# 6	Compare
			video-to-text R@10	85.7	# 7	Compare
			video-to-text Median Rank	1	# 1	Compare
			video-to-text Mean Rank	9.0	# 13	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

CLIP2TV: Align, Match and Distill for Video-Text Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove