TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	MSR-VTT	JEMC	text-to-video R@1	7.0	# 37
Video Retrieval	MSR-VTT	JEMC	text-to-video R@5	20.9	# 32
Video Retrieval	MSR-VTT	JEMC	text-to-video R@10	29.7	# 33
Video Retrieval	MSR-VTT	JEMC	text-to-video Mean Rank	213.8	# 7
Video Retrieval	MSR-VTT	JEMC	text-to-video Median Rank	29.7	# 17
Video Retrieval	MSR-VTT	JEMC	video-to-text R@1	12.5	# 12
Video Retrieval	MSR-VTT	JEMC	video-to-text R@5	32.1	# 11
Video Retrieval	MSR-VTT	JEMC	video-to-text R@10	42.2	# 9
Video Retrieval	MSR-VTT	JEMC	video-to-text Median Rank	16	# 6
Video Retrieval	MSR-VTT	JEMC	video-to-text Mean Rank	134	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-joint-embedding-with-multimodal-cues/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=learning-joint-embedding-with-multimodal-cues)`

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

ICMR 2018 · Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, Amit K. Roy-Chowdhury ·

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval methods by learning joint representations, the video-text retrieval task, in contrast, has not been explored to its fullest extent. In this paper, we study how to effectively utilize available multi-modal cues from videos for the cross-modal video-text retrieval task. Based on our analysis, we propose a novel framework that simultaneously utilizes multimodal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval. Furthermore, we explore several loss functions in training the joint embedding and propose a modified pairwise ranking loss for the retrieval task. Experiments on MSVD and MSR-VTT datasets demonstrate that our method achieves significant performance gain compared to the state-of-the-art approaches.

PDF Abstract

Code

Add Remove Mark official

niluthpol/multimodal_vtt

Tasks

Add Remove

Retrieval

Text Retrieval

Video Retrieval

Video-Text Retrieval

Datasets

MSR-VTT

MSVD

Results from the Paper

Add Remove

Ranked #37 on Video Retrieval on MSR-VTT

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	MSR-VTT	JEMC	text-to-video R@1	7.0	# 37	Compare
			text-to-video R@5	20.9	# 32	Compare
			text-to-video R@10	29.7	# 33	Compare
			text-to-video Mean Rank	213.8	# 7	Compare
			text-to-video Median Rank	29.7	# 17	Compare
			video-to-text R@1	12.5	# 12	Compare
			video-to-text R@5	32.1	# 11	Compare
			video-to-text R@10	42.2	# 9	Compare
			video-to-text Median Rank	16	# 6	Compare
			video-to-text Mean Rank	134	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove