TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (SwinV2-H, 512)	Number of Params	658M	# 6
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (SwinV2-H, 512)	Top 1 Accuracy	87.1%	# 10
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (ViT-B/16)	Number of Params	85M	# 39
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (ViT-B/16)	Top 1 Accuracy	83.8%	# 42
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (Swin-B)	Number of Params	88M	# 33
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (Swin-B)	Top 1 Accuracy	84.0%	# 38
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (Swin-L)	Number of Params	197M	# 27
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (Swin-L)	Top 1 Accuracy	85.4%	# 23

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/simmim-a-simple-framework-for-masked-image/self-supervised-image-classification-on-1)](https://paperswithcode.com/sota/self-supervised-image-classification-on-1?p=simmim-a-simple-framework-for-masked-image)`

SimMIM: A Simple Framework for Masked Image Modeling

CVPR 2022 · Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, Han Hu ·

This paper presents SimMIM, a simple framework for masked image modeling. We simplify recently proposed related approaches without special designs such as block-wise masking and tokenization via discrete VAE or clustering. To study what let the masked image modeling task learn good representations, we systematically study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting raw pixels of RGB values by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones. Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%. When applied on a larger model of about 650 million parameters, SwinV2-H, it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by $40\times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks. The code and models will be publicly available at https://github.com/microsoft/SimMIM.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

microsoft/simmim official

869

Westlake-AI/openmixup

570

impiga/plain-detr

182

Hazqeel09/ellzaf_ml

Tasks

Add Remove

Representation Learning

Self-Supervised Image Classification

Task 2

Datasets

ImageNet

MS COCO

Kinetics

Kinetics 400

iNaturalist JFT-3B

Results from the Paper

Edit

Ranked #10 on Self-Supervised Image Classification on ImageNet (finetuned)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (SwinV2-H, 512)	Number of Params	658M	# 6	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (SwinV2-H, 512)	Top 1 Accuracy	87.1%	# 10	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (ViT-B/16)	Number of Params	85M	# 39	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (ViT-B/16)	Top 1 Accuracy	83.8%	# 42	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (Swin-B)	Number of Params	88M	# 33	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (Swin-B)	Top 1 Accuracy	84.0%	# 38	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (Swin-L)	Number of Params	197M	# 27	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	SimMIM (Swin-L)	Top 1 Accuracy	85.4%	# 23	Compare

Methods

Add Remove

VAE

Edit Social Preview

SimMIM: A Simple Framework for Masked Image Modeling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove