VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

23 Mar 2022  ·  Zhan Tong, Yibing Song, Jue Wang, LiMin Wang ·

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition AVA v2.2 VideoMAE (K400 pretrain, ViT-L, 16x4) mAP 34.3 # 19
Action Recognition AVA v2.2 VideoMAE (K400 pretrain+finetune, ViT-B, 16x4) mAP 31.8 # 22
Action Recognition AVA v2.2 VideoMAE (K400 pretrain+finetune, ViT-L, 16x4) mAP 37.8 # 13
Action Recognition AVA v2.2 VideoMAE (K400 pretrain, ViT-B, 16x4) mAP 26.7 # 32
Action Recognition AVA v2.2 VideoMAE (K400 pretrain, ViT-H, 16x4) mAP 36.5 # 15
Action Recognition AVA v2.2 VideoMAE (K400 pretrain+finetune, ViT-H, 16x4) mAP 39.5 # 10
Action Recognition AVA v2.2 VideoMAE (K700 pretrain+finetune, ViT-L, 16x4) mAP 39.3 # 11
Action Recognition AVA v2.2 VideoMAE (K700 pretrain, ViT-L, 16x4) mAP 36.1 # 16
Self-Supervised Action Recognition HMDB51 VideoMAE(no extra data) Top-1 Accuracy 62.6 # 22
Pre-Training Dataset no extra data # 1
Frozen false # 1
Self-Supervised Action Recognition HMDB51 VideoMAE Top-1 Accuracy 73.3 # 5
Pre-Training Dataset Kinetics400 # 1
Frozen false # 1
Action Classification Kinetics-400 VideoMAE (no extra data, ViT-L, 32x320x320) Acc@1 86.1 # 40
Acc@5 97.3 # 26
Action Classification Kinetics-400 VideoMAE (no extra data, ViT-L, 16x4) Acc@1 85.2 # 46
Acc@5 96.8 # 35
Action Classification Kinetics-400 VideoMAE (no extra data, ViT-H) Acc@1 86.6 # 36
Acc@5 97.1 # 30
Action Classification Kinetics-400 VideoMAE (no extra data, ViT-H, 32x320x320) Acc@1 87.4 # 26
Acc@5 97.6 # 17
Action Classification Kinetics-400 VideoMAE (no extra data, ViT-B, 16x4) Acc@1 81.5 # 67
Acc@5 95.1 # 51
Action Recognition Something-Something V2 VideoMAE (no extra data, ViT-L, 16frame) Top-1 Accuracy 74.3 # 12
Top-5 Accuracy 94.6 # 8
Parameters 305 # 16
GFLOPs 597x6 # 6
Action Recognition Something-Something V2 VideoMAE (no extra data, ViT-L, 32x2) Top-1 Accuracy 75.4 # 7
Top-5 Accuracy 95.2 # 4
Parameters 305 # 16
GFLOPs 1436x3 # 6
Action Recognition Something-Something V2 VideoMAE (no extra data, ViT-B, 16frame) Top-1 Accuracy 70.8 # 28
Top-5 Accuracy 92.4 # 25
Parameters 87 # 25
GFLOPs 180x6 # 6
Self-Supervised Action Recognition UCF101 VideoMAE 3-fold Accuracy 96.1 # 5
Pre-Training Dataset Kinetics400 # 1
Frozen false # 1
Self-Supervised Action Recognition UCF101 VideoMAE(no extra data) 3-fold Accuracy 91.3 # 19
Pre-Training Dataset no extra data # 1
Frozen false # 1

Methods


No methods listed for this paper. Add relevant methods here