TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	MSR-VTT	RoME	text-to-video R@1	10.7	# 34
Video Retrieval	MSR-VTT	RoME	text-to-video R@5	29.6	# 30
Video Retrieval	MSR-VTT	RoME	text-to-video R@10	41.2	# 31
Video Retrieval	MSR-VTT	RoME	text-to-video Median Rank	17	# 16
Video Retrieval	YouCook2	RoME	text-to-video Median Rank	53	# 8
Video Retrieval	YouCook2	RoME	text-to-video R@1	6.3	# 12
Video Retrieval	YouCook2	RoME	text-to-video R@10	25.2	# 14
Video Retrieval	YouCook2	RoME	text-to-video R@5	16.9	# 11

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rome-role-aware-mixture-of-expert-transformer/video-retrieval-on-youcook2)](https://paperswithcode.com/sota/video-retrieval-on-youcook2?p=rome-role-aware-mixture-of-expert-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rome-role-aware-mixture-of-expert-transformer/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=rome-role-aware-mixture-of-expert-transformer)`

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

26 Jun 2022 · Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim ·

Seas of videos are uploaded daily with the popularity of social channels; thus, retrieving the most related video contents with user textual queries plays a more crucial role. Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality. Some other approaches consider multiple embedding spaces consisting of global and local features separately, ignoring rich inter-modality correlations. We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels; the roles of spatial contexts, temporal contexts, and object contexts. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels with mixture-of-experts for considering inter-modalities and structures' correlations. The results indicate that our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets, given the same visual backbone without pre-training. Finally, we conducted extensive ablation studies to elucidate our design choices.

PDF Abstract

Code

Add Remove Mark official

buraksatar/RoME_video_retrieval official

Tasks

Add Remove

Retrieval

Text to Video Retrieval

Video Retrieval

Datasets

MSR-VTT

YouCook2

Results from the Paper

Edit

Ranked #12 on Video Retrieval on YouCook2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	MSR-VTT	RoME	text-to-video R@1	10.7	# 34	Compare
			text-to-video R@5	29.6	# 30	Compare
			text-to-video R@10	41.2	# 31	Compare
			text-to-video Median Rank	17	# 16	Compare
Video Retrieval	YouCook2	RoME	text-to-video Median Rank	53	# 8	Compare
			text-to-video R@1	6.3	# 12	Compare
			text-to-video R@10	25.2	# 14	Compare
			text-to-video R@5	16.9	# 11	Compare

Methods

Add Remove

ROME

Edit Social Preview

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove