TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Retrieval	DiDeMo	MILES	text-to-video R@1	27.2	# 17
Zero-Shot Video Retrieval	DiDeMo	MILES	text-to-video R@5	50.3	# 19
Zero-Shot Video Retrieval	DiDeMo	MILES	text-to-video R@10	63.6	# 17
Zero-Shot Video Retrieval	DiDeMo	MILES	text-to-video Median Rank	5.0	# 4
Zero-Shot Video Retrieval	LSMDC	MILES	text-to-video R@1	11.1	# 13
Zero-Shot Video Retrieval	LSMDC	MILES	text-to-video R@5	24.7	# 13
Zero-Shot Video Retrieval	LSMDC	MILES	text-to-video R@10	30.6	# 13
Zero-Shot Video Retrieval	LSMDC	MILES	text-to-video Median Rank	50.7	# 4
Zero-Shot Video Retrieval	MSR-VTT	MILES	text-to-video R@1	26.1	# 21
Zero-Shot Video Retrieval	MSR-VTT	MILES	text-to-video R@5	47.2	# 22
Zero-Shot Video Retrieval	MSR-VTT	MILES	text-to-video R@10	56.9	# 22
Zero-Shot Video Retrieval	MSR-VTT	MILES	text-to-video Median Rank	7	# 5
Zero-Shot Video Retrieval	MSVD	MILES	text-to-video R@1	44.4	# 7
Zero-Shot Video Retrieval	MSVD	MILES	text-to-video R@5	76.2	# 7
Zero-Shot Video Retrieval	MSVD	MILES	text-to-video R@10	87.0	# 5
Zero-Shot Video Retrieval	MSVD	MILES	text-to-video Median Rank	2.0	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/miles-visual-bert-pre-training-with-injected/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=miles-visual-bert-pre-training-with-injected)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/miles-visual-bert-pre-training-with-injected/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=miles-visual-bert-pre-training-with-injected)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/miles-visual-bert-pre-training-with-injected/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=miles-visual-bert-pre-training-with-injected)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/miles-visual-bert-pre-training-with-injected/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=miles-visual-bert-pre-training-with-injected)`

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

26 Apr 2022 · Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo ·

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.

PDF Abstract

Code

Add Remove Mark official

tencentarc/mcq official

129

Tasks

Add Remove

Action Recognition

Retrieval

Text Retrieval

Text to Video Retrieval

Video Retrieval

Video-Text Retrieval

Video to Text Retrieval

Zero-Shot Action Recognition

Zero-Shot Video Retrieval

Datasets

UCF101

HMDB51

MSR-VTT

MSVD

HowTo100M

DiDeMo

WebVid

LSMDC

Results from the Paper

Edit

Ranked #7 on Zero-Shot Video Retrieval on MSVD

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Retrieval	DiDeMo	MILES	text-to-video R@1	27.2	# 17	Compare
			text-to-video R@5	50.3	# 19	Compare
			text-to-video R@10	63.6	# 17	Compare
			text-to-video Median Rank	5.0	# 4	Compare
Zero-Shot Video Retrieval	LSMDC	MILES	text-to-video R@1	11.1	# 13	Compare
			text-to-video R@5	24.7	# 13	Compare
			text-to-video R@10	30.6	# 13	Compare
			text-to-video Median Rank	50.7	# 4	Compare
Zero-Shot Video Retrieval	MSR-VTT	MILES	text-to-video R@1	26.1	# 21	Compare
			text-to-video R@5	47.2	# 22	Compare
			text-to-video R@10	56.9	# 22	Compare
			text-to-video Median Rank	7	# 5	Compare
Zero-Shot Video Retrieval	MSVD	MILES	text-to-video R@1	44.4	# 7	Compare
			text-to-video R@5	76.2	# 7	Compare
			text-to-video R@10	87.0	# 5	Compare
			text-to-video Median Rank	2.0	# 3	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay • WordPiece

Edit Social Preview

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove