TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	A2MIM (ViT-B)	Validation mIoU	49	# 132
Semantic Segmentation	ADE20K	A2MIM (ResNet-50)	Validation mIoU	38.3	# 214
Instance Segmentation	COCO test-dev	A2MIM (ResNet-50 2x)	mask AP	34.9	# 96
Object Detection	COCO test-dev	A2MIM (ViT-B)	box mAP	49.4	# 90
Object Detection	COCO test-dev	A2MIM (ResNet-50 2x)	box mAP	39.8	# 193
Instance Segmentation	COCO test-dev	A2MIM (ViT-B)	mask AP	43.5	# 46
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM (ResNet-50 RSB-A2)	Top 1 Accuracy	80.4%	# 57
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM (ResNet-50 RSB-A3)	Top 1 Accuracy	78.8%	# 59
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM (ViT-S)	Top 1 Accuracy	82.2%	# 52
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM+ (ViT-S)	Top 1 Accuracy	82.4%	# 51
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM (ViT-B)	Top 1 Accuracy	84.2%	# 33
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM+ (ViT-B)	Top 1 Accuracy	84.5%	# 30
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM+ (ResNet-50 RSB-A2)	Top 1 Accuracy	80.5%	# 56
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM+ (ResNet-50 RSB-A3)	Top 1 Accuracy	78.9%	# 58

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/architecture-agnostic-masked-image-modeling/self-supervised-image-classification-on-1)](https://paperswithcode.com/sota/self-supervised-image-classification-on-1?p=architecture-agnostic-masked-image-modeling)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/architecture-agnostic-masked-image-modeling/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=architecture-agnostic-masked-image-modeling)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/architecture-agnostic-masked-image-modeling/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=architecture-agnostic-masked-image-modeling)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/architecture-agnostic-masked-image-modeling/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=architecture-agnostic-masked-image-modeling)`

Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

27 May 2022 · Siyuan Li, Di wu, Fang Wu, Zelin Zang, Stan. Z. Li ·

Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers. Its underlying idea is simple: a portion of the input image is masked out and then reconstructed via a pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this work, we observe that MIM essentially teaches the model to learn better middle-order interactions among patches for more generalized feature extraction. We then propose an Architecture-Agnostic Masked Image Modeling framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that A$^2$MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.

PDF Abstract

Code

Add Remove Mark official

Westlake-AI/openmixup official

570

Westlake-AI/A2MIM official

open-mmlab/mmpretrain

3,157

Tasks

Add Remove

Image Classification

Instance Segmentation

Object Detection

Self-Supervised Image Classification

Self-Supervised Learning

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Results from the Paper

Edit

Ranked #30 on Self-Supervised Image Classification on ImageNet (finetuned)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	A2MIM (ViT-B)	Validation mIoU	49	# 132	Compare
Semantic Segmentation	ADE20K	A2MIM (ResNet-50)	Validation mIoU	38.3	# 214	Compare
Instance Segmentation	COCO test-dev	A2MIM (ResNet-50 2x)	mask AP	34.9	# 96	Compare
Object Detection	COCO test-dev	A2MIM (ViT-B)	box mAP	49.4	# 90	Compare
Object Detection	COCO test-dev	A2MIM (ResNet-50 2x)	box mAP	39.8	# 193	Compare
Instance Segmentation	COCO test-dev	A2MIM (ViT-B)	mask AP	43.5	# 46	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM (ResNet-50 RSB-A2)	Top 1 Accuracy	80.4%	# 57	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM (ResNet-50 RSB-A3)	Top 1 Accuracy	78.8%	# 59	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM (ViT-S)	Top 1 Accuracy	82.2%	# 52	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM+ (ViT-S)	Top 1 Accuracy	82.4%	# 51	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM (ViT-B)	Top 1 Accuracy	84.2%	# 33	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM+ (ViT-B)	Top 1 Accuracy	84.5%	# 30	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM+ (ResNet-50 RSB-A2)	Top 1 Accuracy	80.5%	# 56	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	A2MIM+ (ResNet-50 RSB-A3)	Top 1 Accuracy	78.9%	# 58	Compare

Methods

Add Remove

Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • MIM • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove