TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Segmentation	COIN	VLM	Frame accuracy	68.4	# 5
Temporal Action Localization	CrossTask	VLM	Recall	46.5	# 2
Video Retrieval	MSR-VTT-1kA	VLM	text-to-video R@1	28.10	# 49
Video Retrieval	MSR-VTT-1kA	VLM	text-to-video R@5	55.50	# 47
Video Retrieval	MSR-VTT-1kA	VLM	text-to-video R@10	67.40	# 50
Video Retrieval	MSR-VTT-1kA	VLM	text-to-video Median Rank	4	# 28
Video Retrieval	YouCook2	VLM	text-to-video Median Rank	4	# 3
Video Retrieval	YouCook2	VLM	text-to-video R@1	27.05	# 7
Video Retrieval	YouCook2	VLM	text-to-video R@10	69.38	# 8
Video Retrieval	YouCook2	VLM	text-to-video R@5	56.88	# 7
Video Captioning	YouCook2	VLM	BLEU-3	17.78	# 4
Video Captioning	YouCook2	VLM	BLEU-4	12.27	# 5
Video Captioning	YouCook2	VLM	METEOR	18.22	# 5
Video Captioning	YouCook2	VLM	ROUGE-L	41.51	# 3
Video Captioning	YouCook2	VLM	CIDEr	1.3869	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlm-task-agnostic-video-language-model-pre/temporal-action-localization-on-crosstask)](https://paperswithcode.com/sota/temporal-action-localization-on-crosstask?p=vlm-task-agnostic-video-language-model-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlm-task-agnostic-video-language-model-pre/action-segmentation-on-coin)](https://paperswithcode.com/sota/action-segmentation-on-coin?p=vlm-task-agnostic-video-language-model-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlm-task-agnostic-video-language-model-pre/video-captioning-on-youcook2)](https://paperswithcode.com/sota/video-captioning-on-youcook2?p=vlm-task-agnostic-video-language-model-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlm-task-agnostic-video-language-model-pre/video-retrieval-on-youcook2)](https://paperswithcode.com/sota/video-retrieval-on-youcook2?p=vlm-task-agnostic-video-language-model-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlm-task-agnostic-video-language-model-pre/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=vlm-task-agnostic-video-language-model-pre)`

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Findings (ACL) 2021 · Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer ·

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

PDF Abstract Findings (ACL) 2021 PDF Findings (ACL) 2021 Abstract

Code

Add Remove Mark official

pytorch/fairseq official

29,251

Tasks

Add Remove

Action Segmentation

Language Modelling

Retrieval

Temporal Action Localization

Video Captioning

Video Retrieval

Video Understanding

Datasets

MSR-VTT

HowTo100M

YouCook2 COIN

CrossTask

Results from the Paper

Edit

Ranked #2 on Temporal Action Localization on CrossTask (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Segmentation	COIN	VLM	Frame accuracy	68.4	# 5	Compare
Temporal Action Localization	CrossTask	VLM	Recall	46.5	# 2	Compare
Video Retrieval	MSR-VTT-1kA	VLM	text-to-video R@1	28.10	# 49	Compare
			text-to-video R@5	55.50	# 47	Compare
			text-to-video R@10	67.40	# 50	Compare
			text-to-video Median Rank	4	# 28	Compare
Video Retrieval	YouCook2	VLM	text-to-video Median Rank	4	# 3	Compare
			text-to-video R@1	27.05	# 7	Compare
			text-to-video R@10	69.38	# 8	Compare
			text-to-video R@5	56.88	# 7	Compare
Video Captioning	YouCook2	VLM	BLEU-3	17.78	# 4	Compare
			BLEU-4	12.27	# 5	Compare
			METEOR	18.22	# 5	Compare
			ROUGE-L	41.51	# 3	Compare
			CIDEr	1.3869	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove