TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Retrieval	DiDeMo	LaT	text-to-video R@1	22.6	# 22
Zero-Shot Video Retrieval	DiDeMo	LaT	text-to-video R@5	45.9	# 25
Zero-Shot Video Retrieval	DiDeMo	LaT	text-to-video R@10	58.9	# 21
Zero-Shot Video Retrieval	DiDeMo	LaT	video-to-text R@1	22.5	# 8
Zero-Shot Video Retrieval	DiDeMo	LaT	text-to-video Median Rank	7	# 8
Zero-Shot Video Retrieval	DiDeMo	LaT	video-to-text R@5	45.2	# 8
Zero-Shot Video Retrieval	DiDeMo	LaT	video-to-text R@10	56.8	# 8
Zero-Shot Video Retrieval	DiDeMo	LaT	video-to-text Median Rank	7	# 1
Zero-Shot Video Retrieval	MSR-VTT	LaT	text-to-video R@1	23.4	# 26
Zero-Shot Video Retrieval	MSR-VTT	LaT	text-to-video R@5	44.1	# 26
Zero-Shot Video Retrieval	MSR-VTT	LaT	text-to-video R@10	53.3	# 26
Zero-Shot Video Retrieval	MSR-VTT	LaT	video-to-text R@1	17.2	# 8
Zero-Shot Video Retrieval	MSR-VTT	LaT	text-to-video Median Rank	8	# 8
Zero-Shot Video Retrieval	MSR-VTT	LaT	video-to-text R@5	36.2	# 7
Zero-Shot Video Retrieval	MSR-VTT	LaT	video-to-text R@10	47.9	# 7
Zero-Shot Video Retrieval	MSR-VTT	LaT	video-to-text Median Rank	12	# 3
Zero-Shot Video Retrieval	MSVD	LaT	text-to-video R@1	36.9	# 11
Zero-Shot Video Retrieval	MSVD	LaT	video-to-text R@1	34.4	# 8
Zero-Shot Video Retrieval	MSVD	LaT	text-to-video R@5	68.6	# 9
Zero-Shot Video Retrieval	MSVD	LaT	text-to-video R@10	81.0	# 9
Zero-Shot Video Retrieval	MSVD	LaT	video-to-text R@5	69.0	# 7
Zero-Shot Video Retrieval	MSVD	LaT	video-to-text R@10	79.2	# 7
Zero-Shot Video Retrieval	MSVD	LaT	text-to-video Median Rank	2	# 3
Zero-Shot Video Retrieval	MSVD	LaT	video-to-text Median Rank	3	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lat-latent-translation-with-cycle-consistency/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=lat-latent-translation-with-cycle-consistency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lat-latent-translation-with-cycle-consistency/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=lat-latent-translation-with-cycle-consistency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lat-latent-translation-with-cycle-consistency/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=lat-latent-translation-with-cycle-consistency)`

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

11 Jul 2022 · Jinbin Bai, Chunhui Liu, Feiyue Ni, Haofan Wang, Mengying Hu, Xiaofeng Guo, Lele Cheng ·

Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space $\mathcal{S}$ to a target modality space $\mathcal{T}$ without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consistency between translations, we adopt a cycle loss involving both forward translations from $\mathcal{S}$ to the predicted target space $\mathcal{T'}$, and backward translations from $\mathcal{T'}$ back to $\mathcal{S}$. Extensive experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the superiority and effectiveness of our LaT approach compared with vanilla state-of-the-art methods.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Representation Learning

Retrieval

Text Retrieval

Translation

Video Retrieval

Video-Text Retrieval

Zero-Shot Video Retrieval

Datasets

MSR-VTT

Conceptual Captions

MSVD

HowTo100M

DiDeMo

WebVid

Results from the Paper

Edit

Ranked #11 on Zero-Shot Video Retrieval on MSVD

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Retrieval	DiDeMo	LaT	text-to-video R@1	22.6	# 22	Compare
			text-to-video R@5	45.9	# 25	Compare
			text-to-video R@10	58.9	# 21	Compare
			video-to-text R@1	22.5	# 8	Compare
			text-to-video Median Rank	7	# 8	Compare
			video-to-text R@5	45.2	# 8	Compare
			video-to-text R@10	56.8	# 8	Compare
			video-to-text Median Rank	7	# 1	Compare
Zero-Shot Video Retrieval	MSR-VTT	LaT	text-to-video R@1	23.4	# 26	Compare
			text-to-video R@5	44.1	# 26	Compare
			text-to-video R@10	53.3	# 26	Compare
			video-to-text R@1	17.2	# 8	Compare
			text-to-video Median Rank	8	# 8	Compare
			video-to-text R@5	36.2	# 7	Compare
			video-to-text R@10	47.9	# 7	Compare
			video-to-text Median Rank	12	# 3	Compare
Zero-Shot Video Retrieval	MSVD	LaT	text-to-video R@1	36.9	# 11	Compare
			video-to-text R@1	34.4	# 8	Compare
			text-to-video R@5	68.6	# 9	Compare
			text-to-video R@10	81.0	# 9	Compare
			video-to-text R@5	69.0	# 7	Compare
			video-to-text R@10	79.2	# 7	Compare
			text-to-video Median Rank	2	# 3	Compare
			video-to-text Median Rank	3	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove