TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	MSR-VTT	UCoFiA	text-to-video R@1	49.4	# 11
Video Retrieval	MSR-VTT	UCoFiA	text-to-video R@5	72.1	# 10
Video Retrieval	MSR-VTT	UCoFiA	text-to-video R@10	83.5	# 8
Video Retrieval	MSR-VTT-1kA	UCoFiA	text-to-video R@1	49.4	# 18
Video Retrieval	MSR-VTT-1kA	UCoFiA	text-to-video R@5	72.1	# 30
Video Retrieval	MSR-VTT-1kA	UCoFiA	text-to-video R@10	83.5	# 21
Video Retrieval	MSR-VTT-1kA	UCoFiA	video-to-text R@1	47.1	# 16
Video Retrieval	MSR-VTT-1kA	UCoFiA	video-to-text R@5	74.3	# 12
Video Retrieval	MSR-VTT-1kA	UCoFiA	video-to-text R@10	83.0	# 19

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unified-coarse-to-fine-alignment-for-video/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=unified-coarse-to-fine-alignment-for-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unified-coarse-to-fine-alignment-for-video/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=unified-coarse-to-fine-alignment-for-video)`

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

ICCV 2023 · Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal ·

The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the crossmodal similarity of different granularity, UCoFiA allows the effective unification of multi-grained alignments. Empirically, UCoFiA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%, 1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo, respectively. Our code is publicly available at https://github.com/Ziyang412/UCoFiA.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

ziyang412/ucofia official

Tasks

Add Remove

Retrieval

Text Retrieval

Text to Video Retrieval

Video Retrieval

Video-Text Retrieval

Datasets

MSR-VTT

MSVD

DiDeMo

Results from the Paper

Edit

Ranked #11 on Video Retrieval on MSR-VTT

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	MSR-VTT	UCoFiA	text-to-video R@1	49.4	# 11	Compare
			text-to-video R@5	72.1	# 10	Compare
			text-to-video R@10	83.5	# 8	Compare
Video Retrieval	MSR-VTT-1kA	UCoFiA	text-to-video R@1	49.4	# 18	Compare
			text-to-video R@5	72.1	# 30	Compare
			text-to-video R@10	83.5	# 21	Compare
			video-to-text R@1	47.1	# 16	Compare
			video-to-text R@5	74.3	# 12	Compare
			video-to-text R@10	83.0	# 19	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove