SimVTP: Simple Video Text Pre-training with Masked Autoencoders

7 Dec 2022  ·  Yue Ma, Tianyu Yang, Yin Shan, Xiu Li ·

This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text and then feed them into a unified autencoder to reconstruct the missing pixels and words. Our SimVTP has several properties: 1) Thanks to the unified autoencoder, SimVTP reconstructs the masked signal of one modality with the help from another modality, which implicitly learns the cross-modal alignment between video tubes and text tokens. 2) SimVTP not only benefits from a high video masking ratio (e.g. 90%) due to the temporal redundancy of video, but also needs a high text masking ratio (e.g. 75%), which is much higher than BERT (e.g. 15%), to achieve optimal performance. This is because the aid of video modality makes text reconstruction less challenging, which thus needs a higher mask ratio to make the pretext harder for useful feature learning. 3) Equipping SimVTP with video-text contrastive learning (VTC) and video-text matching (VTM), which are two commonly used cross-modal training strategies, could further improve the transferable performance significantly. 4) SimVTP is dataefficent, e.g., pre-training only on 10% data of WebVid-2M, SimVTP achieves surprisingly good results (43.8 R@1) on MSRVTT, which is far above recent state-of-the-art methods pre-trained on both CC3M and WebVid-2M. We transfer our pre-trained model to various downstream tasks and achieve superior performance. The codes and models will be released at

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Moment Retrieval Charades-STA SimVTP R@1 IoU=0.5 44.7 # 5
R@1 IoU=0.7 26.3 # 4
R@5 IoU=0.5 83.7 # 3
R@5 IoU=0.7 55.1 # 2