Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

Benefiting from masked visual modeling, self-supervised video representation learning has achieved remarkable progress. However, existing methods focus on learning representations from scratch through reconstructing low-level features like raw pixel RGB values. In this paper, we propose masked video distillation (MVD), a simple yet effective two-stage masked feature modeling framework for video representation learning: firstly we pretrain an image (or video) model by recovering low-level features of masked patches, then we use the resulting features as targets for masked feature modeling. For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks. Visualization analysis also indicates different teachers produce different learned patterns for students. Motivated by this observation, we design a spatial-temporal co-teaching method for MVD. Specifically, we distill student models from both video teachers and image teachers by masked feature modeling. Extensive experimental results demonstrate that video transformers pretrained with spatial-temporal co-teaching outperform models distilled with a single teacher on a multitude of video datasets. Our MVD with vanilla ViT achieves state-of-the-art performance compared with previous supervised or self-supervised methods on several challenging video downstream tasks. For example, with the ViT-Large model, our MVD achieves 86.4% and 76.7% Top-1 accuracy on Kinetics-400 and Something-Something-v2, outperforming VideoMAE by 1.2% and 2.4% respectively. When a larger ViT-Huge model is adopted, MVD achieves the state-of-the-art performance with 77.3% Top-1 accuracy on Something-Something-v2 and 41.1 mAP on AVA v2.2. Code will be available at \url{https://github.com/ruiwang2021/mvd}.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition AVA v2.2 MVD (Kinetics400 pretrain, ViT-B, 16x4) mAP 31.1 # 25
Action Recognition AVA v2.2 MVD (Kinetics400 pretrain, ViT-H, 16x4) mAP 40.1 # 7
Action Recognition AVA v2.2 MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4) mAP 41.1 # 5
Action Recognition AVA v2.2 MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4) mAP 38.7 # 12
Action Recognition AVA v2.2 MVD (Kinetics400 pretrain, ViT-L, 16x4) mAP 37.7 # 14
Action Recognition AVA v2.2 MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4) mAP 34.2 # 20
Self-Supervised Action Recognition HMDB51 MVD (ViT-B) Top-1 Accuracy 79.7 # 1
Pre-Training Dataset Kinetics400 # 1
Frozen false # 1
Action Classification Kinetics-400 MVD (K400 pretrain, ViT-L, 16x224x224) Acc@1 86.4 # 43
Acc@5 97.0 # 34
Action Classification Kinetics-400 MVD (K400 pretrain, ViT-H, 16x224x224) Acc@1 87.2 # 33
Acc@5 97.4 # 24
Action Classification Kinetics-400 MVD (K400 pretrain, ViT-B, 16x224x224) Acc@1 83.4 # 62
Acc@5 95.8 # 43
Action Classification Kinetics-400 MVD (K400 pretrain, ViT-S, 16x224x224) Acc@1 81.0 # 84
Acc@5 94.8 # 58
Action Recognition Something-Something V2 MVD (Kinetics400 pretrain, ViT-L, 16 frame) Top-1 Accuracy 76.7 # 5
Top-5 Accuracy 95.5 # 3
Parameters 305 # 17
GFLOPs 597x6 # 7
Action Recognition Something-Something V2 MVD (Kinetics400 pretrain, ViT-H, 16 frame) Top-1 Accuracy 77.3 # 1
Top-5 Accuracy 95.7 # 2
Parameters 633 # 14
GFLOPs 1192x6 # 7
Action Recognition Something-Something V2 MVD (Kinetics400 pretrain, ViT-S, 16 frame) Top-1 Accuracy 70.9 # 33
Top-5 Accuracy 92.8 # 21
Parameters 22 # 35
GFLOPs 57x6 # 7
Action Recognition Something-Something V2 MVD (Kinetics400 pretrain, ViT-B, 16 frame) Top-1 Accuracy 73.7 # 16
Top-5 Accuracy 94.0 # 13
Parameters 87 # 26
GFLOPs 180x6 # 7
Self-Supervised Action Recognition UCF101 MVD (ViT-B) 3-fold Accuracy 97.5 # 2
Pre-Training Dataset Kinetics400 # 1
Frozen false # 1

Methods


No methods listed for this paper. Add relevant methods here