Context Autoencoder for Self-Supervised Representation Learning

We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining. The goal is to pretrain an encoder by solving the pretext task: estimate the masked patches from the visible patches in an image. Our approach first feeds the visible patches into the encoder, extracting the representations. Then, we make predictions from visible patches to masked patches in the encoded representation space. We introduce an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. In other words, the predicted representations are expected to lie in the encoded representation space, which empirically shows the benefit to representation learning. Last, the predicted masked patch representations are mapped to the targets of the pretext task through a decoder. In comparison to previous MIM methods (e.g., BEiT) that couple the encoding and pretext task completion roles, our approach benefits the separation of the representation learning (encoding) role and the pretext task completion role, improving the representation learning capacity and accordingly helping more on downstream tasks. In addition, we present the explanations about why contrastive pretraining and supervised pretraining perform similarly and why MIM potentially performs better. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation ADE20K CAE (ViT-L, UperNet) Validation mIoU 54.7 # 22
Object Detection COCO minival CAE (ViT-L, Mask R-CNN, 1x schedule) box AP 54.5 # 23
Self-Supervised Image Classification ImageNet CAE (ViT-L/16, Attentive) Top 1 Accuracy 81.2% # 5
Number of Params 307M # 13
Self-Supervised Image Classification ImageNet CAE (ViT-L/16) Top 1 Accuracy 78.1% # 26
Number of Params 307M # 13
Self-Supervised Image Classification ImageNet CAE (ViT-B/16) Top 1 Accuracy 77.1% # 34
Number of Params 86M # 31
Self-Supervised Image Classification ImageNet (finetuned) CAE (ViT-L/16) Number of Params 307M # 7
Top 1 Accuracy 86.3% # 6