TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	Condensed Movies	LF-VILA	text-to-video R@1	13.6	# 3
Video Retrieval	Condensed Movies	LF-VILA	text-to-video R@5	32.5	# 3
Video Retrieval	Condensed Movies	LF-VILA	text-to-video R@10	41.8	# 3
Video Retrieval	QuerYD	LF-VILA	text-to-video R@1	69.7	# 2
Video Retrieval	QuerYD	LF-VILA	text-to-video R@10	90.3	# 2
Video Retrieval	QuerYD	LF-VILA	text-to-video R@5	85.7	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/long-form-video-language-pre-training-with/video-retrieval-on-queryd)](https://paperswithcode.com/sota/video-retrieval-on-queryd?p=long-form-video-language-pre-training-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/long-form-video-language-pre-training-with/video-retrieval-on-condensed-movies)](https://paperswithcode.com/sota/video-retrieval-on-condensed-movies?p=long-form-video-language-pre-training-with)`

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

12 Oct 2022 · Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu ·

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-form video-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capture the rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we introduce two novel designs in our LF-VILA model. We first propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer. We fine-tune the pre-trained LF-VILA model on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering, and achieve new state-of-the-art performances. Specifically, our model achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval task and 2.4% on How2QA task, respectively. We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain.

PDF Abstract

Code

Add Remove Mark official

microsoft/xpretrain official

437

Tasks

Add Remove

Contrastive Learning

Question Answering

Retrieval

Video Question Answering

Video Retrieval

Datasets

Visual Genome

ActivityNet

HowTo100M

DiDeMo

WebVid

ActivityNet-QA COIN

How2QA

Violin Condensed Movies QuerYD

Results from the Paper

Edit

Ranked #2 on Video Retrieval on QuerYD (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	Condensed Movies	LF-VILA	text-to-video R@1	13.6	# 3	Compare
			text-to-video R@5	32.5	# 3	Compare
			text-to-video R@10	41.8	# 3	Compare
Video Retrieval	QuerYD	LF-VILA	text-to-video R@1	69.7	# 2	Compare
			text-to-video R@10	90.3	# 2	Compare
			text-to-video R@5	85.7	# 3	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • ALIGN • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove