TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	text-to-video R@1	46.2	# 19
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	text-to-video R@5	77.0	# 14
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	text-to-video R@10	87.6	# 11
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	text-to-video Median Rank	2	# 5
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	text-to-video Mean Rank	5.7	# 5
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	video-to-text R@1	46.7	# 8
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	video-to-text R@5	77.1	# 5
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	video-to-text R@10	88.0	# 4
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	video-to-text Median Rank	2	# 2
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	video-to-text Mean Rank	5.5	# 4
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	text-to-video R@1	24.2	# 20
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	text-to-video R@5	46.2	# 12
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	text-to-video R@10	55.9	# 9
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	text-to-video Median Rank	8	# 6
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	video-to-text R@1	24.5	# 9
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	video-to-text R@5	46.4	# 5
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	video-to-text R@10	55.8	# 4
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	video-to-text Median Rank	7	# 2
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	text-to-video Mean Rank	47.3	# 4
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	video-to-text Mean Rank	41.3	# 5
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	text-to-video Mean Rank	13.8	# 15
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	text-to-video R@1	48.4	# 25
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	text-to-video R@5	73.8	# 23
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	text-to-video R@10	82.0	# 30
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	text-to-video Median Rank	2	# 10
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	video-to-text R@1	47.7	# 13
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	video-to-text R@5	75.0	# 10
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	video-to-text R@10	83.3	# 17
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	video-to-text Median Rank	2	# 7
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	video-to-text Mean Rank	10.2	# 17
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	text-to-video R@1	50.6	# 11
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	text-to-video R@5	80.3	# 10
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	text-to-video R@10	88.4	# 7
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	text-to-video Median Rank	1	# 1
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	text-to-video Mean Rank	8.4	# 4
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	video-to-text R@1	68.4	# 9
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	video-to-text R@5	90.1	# 7
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	video-to-text R@10	95.0	# 5
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	video-to-text Median Rank	1	# 1
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	video-to-text Mean Rank	3.0	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/centerclip-token-clustering-for-efficient/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=centerclip-token-clustering-for-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/centerclip-token-clustering-for-efficient/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=centerclip-token-clustering-for-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/centerclip-token-clustering-for-efficient/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=centerclip-token-clustering-for-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/centerclip-token-clustering-for-efficient/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=centerclip-token-clustering-for-efficient)`

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

2 May 2022 · Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang ·

Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spatial-temporal relations are well maintained. We instantiate two clustering algorithms to efficiently find deterministic medoids and iteratively partition groups in high dimensional space. Through this token clustering and center selection procedure, we successfully reduce computation costs by removing redundant visual tokens. This method further enhances segment-level semantic alignment between video and text representations, enforcing the spatio-temporal interactions of tokens from within-segment frames. Our method, coined as CenterCLIP, surpasses existing state-of-the-art by a large margin on typical text-video benchmarks, while reducing the training memory cost by 35\% and accelerating the inference speed by 14\% at the best case. The code is available at \href{{https://github.com/mzhaoshuai/CenterCLIP}}{{https://github.com/mzhaoshuai/CenterCLIP}}.

PDF Abstract

Code

Add Remove Mark official

mzhaoshuai/CenterCLIP official

119

Tasks

Add Remove

Clustering

Retrieval

Video Retrieval

Datasets

ActivityNet

MSR-VTT

MSVD

LSMDC

Results from the Paper

Edit

Ranked #11 on Video Retrieval on MSVD (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	CenterCLIP (ViT-B/16)	text-to-video R@1	46.2	# 19	Compare
			text-to-video R@5	77.0	# 14	Compare
			text-to-video R@10	87.6	# 11	Compare
			text-to-video Median Rank	2	# 5	Compare
			text-to-video Mean Rank	5.7	# 5	Compare
			video-to-text R@1	46.7	# 8	Compare
			video-to-text R@5	77.1	# 5	Compare
			video-to-text R@10	88.0	# 4	Compare
			video-to-text Median Rank	2	# 2	Compare
			video-to-text Mean Rank	5.5	# 4	Compare
Video Retrieval	LSMDC	CenterCLIP (ViT-B/16)	text-to-video R@1	24.2	# 20	Compare
			text-to-video R@5	46.2	# 12	Compare
			text-to-video R@10	55.9	# 9	Compare
			text-to-video Median Rank	8	# 6	Compare
			video-to-text R@1	24.5	# 9	Compare
			video-to-text R@5	46.4	# 5	Compare
			video-to-text R@10	55.8	# 4	Compare
			video-to-text Median Rank	7	# 2	Compare
			text-to-video Mean Rank	47.3	# 4	Compare
			video-to-text Mean Rank	41.3	# 5	Compare
Video Retrieval	MSR-VTT-1kA	CenterCLIP (ViT-B/16)	text-to-video Mean Rank	13.8	# 15	Compare
			text-to-video R@1	48.4	# 25	Compare
			text-to-video R@5	73.8	# 23	Compare
			text-to-video R@10	82.0	# 30	Compare
			text-to-video Median Rank	2	# 10	Compare
			video-to-text R@1	47.7	# 13	Compare
			video-to-text R@5	75.0	# 10	Compare
			video-to-text R@10	83.3	# 17	Compare
			video-to-text Median Rank	2	# 7	Compare
			video-to-text Mean Rank	10.2	# 17	Compare
Video Retrieval	MSVD	CenterCLIP (ViT-B/16)	text-to-video R@1	50.6	# 11	Compare
			text-to-video R@5	80.3	# 10	Compare
			text-to-video R@10	88.4	# 7	Compare
			text-to-video Median Rank	1	# 1	Compare
			text-to-video Mean Rank	8.4	# 4	Compare
			video-to-text R@1	68.4	# 9	Compare
			video-to-text R@5	90.1	# 7	Compare
			video-to-text R@10	95.0	# 5	Compare
			video-to-text Median Rank	1	# 1	Compare
			video-to-text Mean Rank	3.0	# 3	Compare

Methods

Add Remove

CLIP • Dense Connections • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • SPEED • Vision Transformer

Edit Social Preview

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove