LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

11 Jul 2022  ·  Jinbin Bai, Chunhui Liu, Feiyue Ni, Haofan Wang, Mengying Hu, Xiaofeng Guo, Lele Cheng ·

Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space $\mathcal{S}$ to a target modality space $\mathcal{T}$ without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consistency between translations, we adopt a cycle loss involving both forward translations from $\mathcal{S}$ to the predicted target space $\mathcal{T'}$, and backward translations from $\mathcal{T'}$ back to $\mathcal{S}$. Extensive experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the superiority and effectiveness of our LaT approach compared with vanilla state-of-the-art methods.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-Shot Video Retrieval DiDeMo LaT text-to-video R@1 22.6 # 22
text-to-video R@5 45.9 # 25
text-to-video R@10 58.9 # 21
video-to-text R@1 22.5 # 8
text-to-video Median Rank 7 # 8
video-to-text R@5 45.2 # 8
video-to-text R@10 56.8 # 8
video-to-text Median Rank 7 # 1
Zero-Shot Video Retrieval MSR-VTT LaT text-to-video R@1 23.4 # 26
text-to-video R@5 44.1 # 26
text-to-video R@10 53.3 # 26
video-to-text R@1 17.2 # 8
text-to-video Median Rank 8 # 8
video-to-text R@5 36.2 # 7
video-to-text R@10 47.9 # 7
video-to-text Median Rank 12 # 3
Zero-Shot Video Retrieval MSVD LaT text-to-video R@1 36.9 # 11
video-to-text R@1 34.4 # 8
text-to-video R@5 68.6 # 9
text-to-video R@10 81.0 # 9
video-to-text R@5 69.0 # 7
video-to-text R@10 79.2 # 7
text-to-video Median Rank 2 # 3
video-to-text Median Rank 3 # 3

Methods


No methods listed for this paper. Add relevant methods here