VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner. The code and model is available at \url{}.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper

 Ranked #1 on Temporal Action Localization on FineAction (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Spatio-Temporal Action Localization AVA-Kinetics VideoMAE V2-g val mAP 43.89 # 2
Action Recognition AVA v2.2 VideoMAE V2-g mAP 42.6 # 3
Temporal Action Localization FineAction VideoMAE V2-g mAP 18.24 # 1
mAP IOU@0.5 29.07 # 1
mAP IOU@0.75 17.66 # 1
mAP IOU@0.95 5.07 # 1
Action Recognition HMDB-51 VideoMAE V2-g Average accuracy of 3 splits 88.1 # 1
Action Classification Kinetics-400 VideoMAE V2-g Acc@1 88.5 # 14
Acc@5 98.1 # 9
Action Classification Kinetics-400 VideoMAE V2-g (64x266x266) Acc@1 90.0 # 5
Acc@5 98.4 # 4
Action Classification Kinetics-600 VideoMAE V2-g Top-1 Accuracy 88.8 # 15
Top-5 Accuracy 98.2 # 8
Action Classification Kinetics-600 VideoMAE V2-g (64x266x266) Top-1 Accuracy 89.9 # 9
Top-5 Accuracy 98.5 # 4
Action Recognition Something-Something V1 VideoMAE V2-g Top 1 Accuracy 68.7 # 2
Top 5 Accuracy 91.9 # 1
Action Recognition Something-Something V2 VideoMAE V2-g Top-1 Accuracy 77.0 # 3
Top-5 Accuracy 95.9 # 1
Parameters 1013 # 12
GFLOPs 2544x6 # 6
Temporal Action Localization THUMOS’14 ActionFormer (VideoMAE V2-g features) mAP IOU@0.5 73.0 # 2
mAP IOU@0.3 84.0 # 2
mAP IOU@0.4 79.6 # 3
mAP IOU@0.6 63.5 # 2
mAP IOU@0.7 47.7 # 2
Avg mAP (0.3:0.7) 69.6 # 3
Action Recognition UCF101 VideoMAE V2-g 3-fold Accuracy 99.6 # 1