VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

PDF Abstract Findings (ACL) 2021 PDF Findings (ACL) 2021 Abstract

Results from the Paper


Ranked #2 on Temporal Action Localization on CrossTask (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Segmentation COIN VLM Frame accuracy 68.4 # 5
Temporal Action Localization CrossTask VLM Recall 46.5 # 2
Video Retrieval MSR-VTT-1kA VLM text-to-video R@1 28.10 # 49
text-to-video R@5 55.50 # 47
text-to-video R@10 67.40 # 50
text-to-video Median Rank 4 # 28
Video Retrieval YouCook2 VLM text-to-video Median Rank 4 # 3
text-to-video R@1 27.05 # 7
text-to-video R@10 69.38 # 8
text-to-video R@5 56.88 # 7
Video Captioning YouCook2 VLM BLEU-3 17.78 # 4
BLEU-4 12.27 # 5
METEOR 18.22 # 5
ROUGE-L 41.51 # 3
CIDEr 1.3869 # 4

Methods


No methods listed for this paper. Add relevant methods here