Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval methods by learning joint representations, the video-text retrieval task, in contrast, has not been explored to its fullest extent. In this paper, we study how to effectively utilize available multi-modal cues from videos for the cross-modal video-text retrieval task. Based on our analysis, we propose a novel framework that simultaneously utilizes multimodal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval. Furthermore, we explore several loss functions in training the joint embedding and propose a modified pairwise ranking loss for the retrieval task. Experiments on MSVD and MSR-VTT datasets demonstrate that our method achieves significant performance gain compared to the state-of-the-art approaches.
PDF AbstractCode
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Video Retrieval | MSR-VTT | JEMC | text-to-video R@1 | 7.0 | # 37 | |
text-to-video R@5 | 20.9 | # 32 | ||||
text-to-video R@10 | 29.7 | # 33 | ||||
text-to-video Mean Rank | 213.8 | # 7 | ||||
text-to-video Median Rank | 29.7 | # 17 | ||||
video-to-text R@1 | 12.5 | # 12 | ||||
video-to-text R@5 | 32.1 | # 11 | ||||
video-to-text R@10 | 42.2 | # 9 | ||||
video-to-text Median Rank | 16 | # 6 | ||||
video-to-text Mean Rank | 134 | # 4 |