UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

15 Feb 2020  ·  Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou ·

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

PDF Abstract

Results from the Paper

 Ranked #1 on Video Captioning on YouCook2 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Retrieval MSR-VTT UniVL text-to-video R@1 21.2 # 12
text-to-video R@5 49.6 # 11
text-to-video R@10 63.1 # 8
text-to-video Median Rank 6 # 8
Video Captioning YouCook2 UniVL BLEU-3 23.87 # 1
BLEU-4 17.35 # 1
METEOR 22.35 # 1
ROUGE-L 46.52 # 1
CIDEr 1.81 # 1
Video Retrieval YouCook2 UniVL text-to-video Median Rank 4 # 2
text-to-video R@1 28.9 # 4
text-to-video R@10 70.0 # 4
text-to-video R@5 57.6 # 4