Masked Feature Prediction for Self-Supervised Visual Pre-Training

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Results from the Paper


Ranked #8 on Action Recognition on AVA v2.2 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition AVA v2.2 MaskFeat (Kinetics-600 pretrain, MViT-L) mAP 39.8 # 8
Self-Supervised Image Classification ImageNet (finetuned) MaskFeat (ViT-L) Number of Params 307M # 13
Top 1 Accuracy 85.7% # 21
Action Classification Kinetics-400 MaskFeat (no extra data, MViT-L) Acc@1 86.7 # 39
Acc@5 97.3 # 27
Action Classification Kinetics-400 MaskFeat (K600, MViT-L) Acc@1 87.0 # 36
Acc@5 97.4 # 24
Action Classification Kinetics-600 MaskFeat (no extra data, MViT-L) Top-1 Accuracy 88.3 # 20
Top-5 Accuracy 98.0 # 9
Action Classification Kinetics-700 MaskFeat (no extra data, MViT-L) Top-1 Accuracy 80.4 # 12
Top-5 Accuracy 95.7 # 5
Action Recognition Something-Something V2 MaskFeat (Kinetics600 pretrain, MViT-L) Top-1 Accuracy 75.0 # 11
Top-5 Accuracy 95.0 # 6
Parameters 218 # 19
GFLOPs 2828*3 # 6

Methods