Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K MAE (ViT-B, UperNet) Validation mIoU 48.1 # 142
Semantic Segmentation ADE20K MAE (ViT-L, UperNet) Validation mIoU 53.6 # 72
Object Detection COCO minival MAE (ViT-B, Mask R-CNN) box AP 50.3 # 75
Object Detection COCO minival MAE (ViT-L, Mask R-CNN) box AP 53.3 # 59
Image Classification ImageNet MAE (ViT-H, 448) Top 1 Accuracy 87.8% # 75
Number of params 656M # 939
Image Classification ImageNet MAE (ViT-H) Top 1 Accuracy 86.9% # 115
Image Classification ImageNet MAE (ViT-L) Top 1 Accuracy 85.9% # 182
Top 1 Accuracy 83.6% # 373
Self-Supervised Image Classification ImageNet MAE (ViT-B) Top 1 Accuracy 68.0% # 96
Self-Supervised Image Classification ImageNet MAE (ViT-H) Top 1 Accuracy 76.6% # 51
Self-Supervised Image Classification ImageNet MAE (ViT-L) Top 1 Accuracy 75.8% # 58
Domain Generalization ImageNet-A MAE (ViT-H, 448) Top-1 accuracy % 76.7 # 6
Domain Generalization ImageNet-C MAE (ViT-H) mean Corruption Error (mCE) 33.8 # 6
Number of params 632M # 41
Self-Supervised Image Classification ImageNet (finetuned) MAE (ViT-H/14) Top 1 Accuracy 86.9% # 11
Self-Supervised Image Classification ImageNet (finetuned) MAE (ViT-H/14, 448) Number of Params 632M # 7
Top 1 Accuracy 87.8% # 7
Domain Generalization ImageNet-R MAE (ViT-H, 448) Top-1 Error Rate 33.5 # 10
Semantic Segmentation ImageNet-S MAE (ViT-B/16, 224x224, SSL, mmseg) mIoU (val) 40.0 # 17
mIoU (test) 40.3 # 14
Semantic Segmentation ImageNet-S MAE (ViT-B/16, 224x224, SSL+FT) mIoU (val) 61.0 # 5
mIoU (test) 60.2 # 4
Semantic Segmentation ImageNet-S MAE (ViT-B/16, 224x224, SSL+FT, mmseg) mIoU (val) 61.6 # 4
mIoU (test) 61.2 # 3
Semantic Segmentation ImageNet-S MAE (ViT-B/16, 224x224, SSL) mIoU (val) 38.3 # 18
mIoU (test) 37.0 # 16
Domain Generalization ImageNet-Sketch MAE (ViT-H, 448) Top-1 accuracy 50.9 # 10
Out-of-Distribution Generalization ImageNet-W MAE (ViT-L/16, fine-tuning) IN-W Gap -4.4 # 1
Carton Gap +22 # 1
Out-of-Distribution Generalization ImageNet-W MAE (ViT-H/14, fine-tuning) IN-W Gap -3.5 # 1
Carton Gap +30 # 1
Out-of-Distribution Generalization ImageNet-W MAE (ViT-B/16, fine-tuning) IN-W Gap -4.6 # 1
Carton Gap +24 # 1
Image Classification iNaturalist MAE (ViT-H, 448) Top 1 Accuracy 83.4 # 2
Image Classification iNaturalist 2018 MAE (ViT-H, 448) Top-1 Accuracy 86.8% # 6
Image Classification iNaturalist 2019 MAE (ViT-H, 448) Top-1 Accuracy 88.3 # 2
Image Classification OmniBenchmark MAE Average Top-1 Accuracy 30.6 # 21
Image Classification Places205 MAE (ViT-H, 448) Top 1 Accuracy 66.8 # 5
Image Classification Places365-Standard MAE (ViT-H, 448) Top 1 Accuracy 60.3 # 3

Methods


CV-MIM โ€ข MAE