TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	MSR-VTT-1kA	SuMA (ViT-B/16)	text-to-video R@1	49.8	# 16
Video Retrieval	MSR-VTT-1kA	SuMA (ViT-B/16)	text-to-video R@5	75.1	# 20
Video Retrieval	MSR-VTT-1kA	SuMA (ViT-B/16)	text-to-video R@10	83.9	# 19
Video Retrieval	MSR-VTT-1kA	SuMA (ViT-B/16)	video-to-text R@1	47.3	# 15
Video Retrieval	MSR-VTT-1kA	SuMA (ViT-B/16)	video-to-text R@5	76	# 8
Video Retrieval	MSR-VTT-1kA	SuMA (ViT-B/16)	video-to-text R@10	84.3	# 11

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-retrieval-by-supervised-multi/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=video-text-retrieval-by-supervised-multi)`

Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

19 Feb 2023 · Yimu Wang, Peng Shi ·

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse space shared between the video and the text for video-text retrieval. The shared sparse space is initialized with a finite number of sparse concepts, each of which refers to a number of words. With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarities. Benefiting from the learned shared sparse space and multi-grained similarities, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods. Our code is available at https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.

PDF Abstract

Code

Add Remove Mark official

yimuwangcs/Better_Cross_Modal_Retri… official

Tasks

Add Remove

Representation Learning

Retrieval

Sparse Learning

Text Retrieval

Video Retrieval

Video-Text Retrieval

Datasets

ActivityNet

MSR-VTT

MSVD

Results from the Paper

Edit

Ranked #16 on Video Retrieval on MSR-VTT-1kA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	MSR-VTT-1kA	SuMA (ViT-B/16)	text-to-video R@1	49.8	# 16	Compare
			text-to-video R@5	75.1	# 20	Compare
			text-to-video R@10	83.9	# 19	Compare
			video-to-text R@1	47.3	# 15	Compare
			video-to-text R@5	76	# 8	Compare
			video-to-text R@10	84.3	# 11	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove