An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Retrieval DiDeMo VIOLETv2 text-to-video R@1 47.9 # 27
text-to-video R@5 76.5 # 23
text-to-video R@10 84.1 # 24
Fill Mask LSMDC VIOLETv2 Accuracy 56.9 # 2
Video Retrieval LSMDC VIOLETv2 text-to-video R@1 24 # 21
text-to-video R@5 43.5 # 16
text-to-video R@10 54.1 # 14
Video Question Answering LSMDC-MC VIOLETv2 Accuracy 84.4 # 1
Video Captioning MSR-VTT VIOLETv2 CIDEr 58 # 18
Video Retrieval MSR-VTT VIOLETv2 text-to-video R@1 37.2 # 15
text-to-video R@5 64.8 # 13
text-to-video R@10 75.8 # 14
Video Question Answering MSRVTT-MC VIOLETv2 Accuracy 97.6 # 1
Video Question Answering MSRVTT-QA VIOLETv2 Accuracy 44.5 # 11
Video Captioning MSVD VIOLETv2 CIDEr 139.2 # 9
Visual Question Answering (VQA) MSVD-QA VIOLETv2 Accuracy 0.547 # 15
TGIF-Frame TGIF-QA VIOLETv2 Accuracy 72.8 # 8
TGIF-Action TGIF-QA VIOLETv2 Accuracy 94.8 # 5
TGIF-Transition TGIF-QA VIOLETv2 Accuracy 99 # 2

Methods