TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	LSMDC	X-Pool	text-to-video R@1	25.2	# 17
Video Retrieval	LSMDC	X-Pool	text-to-video R@5	43.7	# 15
Video Retrieval	LSMDC	X-Pool	text-to-video R@10	53.5	# 17
Video Retrieval	LSMDC	X-Pool	text-to-video Median Rank	8.0	# 6
Video Retrieval	LSMDC	X-Pool	video-to-text R@1	22.7	# 11
Video Retrieval	LSMDC	X-Pool	video-to-text R@5	42.6	# 8
Video Retrieval	LSMDC	X-Pool	video-to-text R@10	51.2	# 8
Video Retrieval	LSMDC	X-Pool	video-to-text Median Rank	10.0	# 5
Video Retrieval	LSMDC	X-Pool	text-to-video Mean Rank	53.2	# 6
Video Retrieval	LSMDC	X-Pool	video-to-text Mean Rank	47.4	# 6
Video Retrieval	MSR-VTT-1kA	X-Pool	text-to-video Mean Rank	14.3	# 17
Video Retrieval	MSR-VTT-1kA	X-Pool	text-to-video R@1	46.9	# 28
Video Retrieval	MSR-VTT-1kA	X-Pool	text-to-video R@5	72.8	# 27
Video Retrieval	MSR-VTT-1kA	X-Pool	text-to-video R@10	82.2	# 29
Video Retrieval	MSR-VTT-1kA	X-Pool	text-to-video Median Rank	2	# 10
Video Retrieval	MSR-VTT-1kA	X-Pool	video-to-text R@1	44.4	# 19
Video Retrieval	MSR-VTT-1kA	X-Pool	video-to-text R@5	73.3	# 17
Video Retrieval	MSR-VTT-1kA	X-Pool	video-to-text R@10	84.0	# 13
Video Retrieval	MSR-VTT-1kA	X-Pool	video-to-text Median Rank	2.0	# 7
Video Retrieval	MSR-VTT-1kA	X-Pool	video-to-text Mean Rank	9.0	# 13
Video Retrieval	MSVD	X-Pool	text-to-video R@1	47.2	# 17
Video Retrieval	MSVD	X-Pool	text-to-video R@5	77.4	# 13
Video Retrieval	MSVD	X-Pool	text-to-video R@10	86.0	# 12
Video Retrieval	MSVD	X-Pool	text-to-video Median Rank	2.0	# 8
Video Retrieval	MSVD	X-Pool	text-to-video Mean Rank	9.3	# 9
Video Retrieval	MSVD	X-Pool	video-to-text R@1	66.4	# 11
Video Retrieval	MSVD	X-Pool	video-to-text R@5	90.0	# 8
Video Retrieval	MSVD	X-Pool	video-to-text R@10	94.2	# 8
Video Retrieval	MSVD	X-Pool	video-to-text Median Rank	1.0	# 1
Video Retrieval	MSVD	X-Pool	video-to-text Mean Rank	3.3	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/x-pool-cross-modal-language-video-attention/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=x-pool-cross-modal-language-video-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/x-pool-cross-modal-language-video-attention/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=x-pool-cross-modal-language-video-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/x-pool-cross-modal-language-video-attention/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=x-pool-cross-modal-language-video-attention)`

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

CVPR 2022 · Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, Guangwei Yu ·

In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text's attention weights over the frames. We evaluate our method on three benchmark datasets of MSR-VTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative improvement in Recall@1. Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text. Full code and demo can be found at: https://layer6ai-labs.github.io/xpool/

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

layer6ai-labs/xpool official

104

Tasks

Add Remove

Retrieval

Text to Video Retrieval

Video Retrieval

Video-Text Retrieval

Datasets

Visual Question Answering

MSR-VTT

MSVD

LSMDC

Results from the Paper

Add Remove

Ranked #17 on Video Retrieval on LSMDC (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	LSMDC	X-Pool	text-to-video R@1	25.2	# 17	Compare
			text-to-video R@5	43.7	# 15	Compare
			text-to-video R@10	53.5	# 17	Compare
			text-to-video Median Rank	8.0	# 6	Compare
			video-to-text R@1	22.7	# 11	Compare
			video-to-text R@5	42.6	# 8	Compare
			video-to-text R@10	51.2	# 8	Compare
			video-to-text Median Rank	10.0	# 5	Compare
			text-to-video Mean Rank	53.2	# 6	Compare
			video-to-text Mean Rank	47.4	# 6	Compare
Video Retrieval	MSR-VTT-1kA	X-Pool	text-to-video Mean Rank	14.3	# 17	Compare
			text-to-video R@1	46.9	# 28	Compare
			text-to-video R@5	72.8	# 27	Compare
			text-to-video R@10	82.2	# 29	Compare
			text-to-video Median Rank	2	# 10	Compare
			video-to-text R@1	44.4	# 19	Compare
			video-to-text R@5	73.3	# 17	Compare
			video-to-text R@10	84.0	# 13	Compare
			video-to-text Median Rank	2.0	# 7	Compare
			video-to-text Mean Rank	9.0	# 13	Compare
Video Retrieval	MSVD	X-Pool	text-to-video R@1	47.2	# 17	Compare
			text-to-video R@5	77.4	# 13	Compare
			text-to-video R@10	86.0	# 12	Compare
			text-to-video Median Rank	2.0	# 8	Compare
			text-to-video Mean Rank	9.3	# 9	Compare
			video-to-text R@1	66.4	# 11	Compare
			video-to-text R@5	90.0	# 8	Compare
			video-to-text R@10	94.2	# 8	Compare
			video-to-text Median Rank	1.0	# 1	Compare
			video-to-text Mean Rank	3.3	# 5	Compare

Methods

Add Remove

Cross-Attention Module • Softmax

Edit Social Preview

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove