TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	LSMDC	MDMMT-2	text-to-video R@1	26.9	# 13
Video Retrieval	LSMDC	MDMMT-2	text-to-video R@5	46.7	# 9
Video Retrieval	LSMDC	MDMMT-2	text-to-video R@10	55.9	# 9
Video Retrieval	LSMDC	MDMMT-2	text-to-video Median Rank	6.7	# 4
Video Retrieval	LSMDC	MDMMT-2	text-to-video Mean Rank	48.0	# 5
Video Retrieval	MSR-VTT	MDMMT-2	text-to-video R@1	33.7	# 18
Video Retrieval	MSR-VTT	MDMMT-2	text-to-video R@5	60.5	# 17
Video Retrieval	MSR-VTT	MDMMT-2	text-to-video R@10	70.8	# 16
Video Retrieval	MSR-VTT	MDMMT-2	text-to-video Mean Rank	37.8	# 1
Video Retrieval	MSR-VTT	MDMMT-2	text-to-video Median Rank	3.0	# 1
Video Retrieval	MSVD	MDMMT-2	text-to-video R@1	56.8	# 7
Video Retrieval	MSVD	MDMMT-2	text-to-video R@5	83.1	# 6
Video Retrieval	MSVD	MDMMT-2	text-to-video R@10	89.2	# 5
Video Retrieval	MSVD	MDMMT-2	text-to-video Median Rank	1.0	# 1
Video Retrieval	MSVD	MDMMT-2	text-to-video Mean Rank	8.8	# 7
Video Retrieval	TGIF	MDMMT-2	text-to-video R@1	25.5	# 1
Video Retrieval	TGIF	MDMMT-2	text-to-video R@5	46.1	# 1
Video Retrieval	TGIF	MDMMT-2	text-to-video R@10	55.7	# 1
Video Retrieval	TGIF	MDMMT-2	text-to-video Mean Rank	94.1	# 1
Video Retrieval	TGIF	MDMMT-2	text-to-video Median Rank	7.0	# 1
Video Retrieval	YouCook2	MDMMT-2	text-to-video Median Rank	3.0	# 1
Video Retrieval	YouCook2	MDMMT-2	text-to-video R@1	32.0	# 4
Video Retrieval	YouCook2	MDMMT-2	text-to-video R@10	74.8	# 3
Video Retrieval	YouCook2	MDMMT-2	text-to-video R@5	64.0	# 2
Video Retrieval	YouCook2	MDMMT-2	text-to-video Mean Rank	12.7	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mdmmt-2-multidomain-multimodal-transformer/video-retrieval-on-tgif)](https://paperswithcode.com/sota/video-retrieval-on-tgif?p=mdmmt-2-multidomain-multimodal-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mdmmt-2-multidomain-multimodal-transformer/video-retrieval-on-youcook2)](https://paperswithcode.com/sota/video-retrieval-on-youcook2?p=mdmmt-2-multidomain-multimodal-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mdmmt-2-multidomain-multimodal-transformer/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=mdmmt-2-multidomain-multimodal-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mdmmt-2-multidomain-multimodal-transformer/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=mdmmt-2-multidomain-multimodal-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mdmmt-2-multidomain-multimodal-transformer/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=mdmmt-2-multidomain-multimodal-transformer)`

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

14 Mar 2022 · Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei Ivaniuta ·

In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones. We introduce three-stage training procedure that provides high transfer knowledge efficiency and allows to use noisy datasets during training without prior knowledge degradation. Additionally, double positional encoding is used for better fusion of different modalities and a simple method for non-square inputs processing is suggested.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Retrieval

Text to Video Retrieval

Video Retrieval

Datasets

MS COCO

Flickr30k

ActivityNet

MSR-VTT

Conceptual Captions

MSVD

HowTo100M

Something-Something V2

YouCook2

TVQA

LSMDC

VATEX

TGIF

Results from the Paper

Edit

Ranked #1 on Video Retrieval on TGIF (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	LSMDC	MDMMT-2	text-to-video R@1	26.9	# 13	Compare
			text-to-video R@5	46.7	# 9	Compare
			text-to-video R@10	55.9	# 9	Compare
			text-to-video Median Rank	6.7	# 4	Compare
			text-to-video Mean Rank	48.0	# 5	Compare
Video Retrieval	MSR-VTT	MDMMT-2	text-to-video R@1	33.7	# 18	Compare
			text-to-video R@5	60.5	# 17	Compare
			text-to-video R@10	70.8	# 16	Compare
			text-to-video Mean Rank	37.8	# 1	Compare
			text-to-video Median Rank	3.0	# 1	Compare
Video Retrieval	MSVD	MDMMT-2	text-to-video R@1	56.8	# 7	Compare
			text-to-video R@5	83.1	# 6	Compare
			text-to-video R@10	89.2	# 5	Compare
			text-to-video Median Rank	1.0	# 1	Compare
			text-to-video Mean Rank	8.8	# 7	Compare
Video Retrieval	TGIF	MDMMT-2	text-to-video R@1	25.5	# 1	Compare
			text-to-video R@5	46.1	# 1	Compare
			text-to-video R@10	55.7	# 1	Compare
			text-to-video Mean Rank	94.1	# 1	Compare
			text-to-video Median Rank	7.0	# 1	Compare
Video Retrieval	YouCook2	MDMMT-2	text-to-video Median Rank	3.0	# 1	Compare
			text-to-video R@1	32.0	# 4	Compare
			text-to-video R@10	74.8	# 3	Compare
			text-to-video R@5	64.0	# 2	Compare
			text-to-video Mean Rank	12.7	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove