TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	MIRL(ViT-S-54)	Top 1 Accuracy	84.8%	# 270
Image Classification	ImageNet	MIRL(ViT-S-54)	Number of params	96M	# 856
Image Classification	ImageNet	MIRL(ViT-S-54)	GFLOPs	18.8	# 361
Image Classification	ImageNet	MIRL (ViT-B-48)	Top 1 Accuracy	86.2%	# 164
Image Classification	ImageNet	MIRL (ViT-B-48)	Number of params	341M	# 923
Image Classification	ImageNet	MIRL (ViT-B-48)	GFLOPs	67.0	# 438
Self-Supervised Image Classification	ImageNet (finetuned)	MIRL (ViT-B-48)	Number of Params	341M	# 12
Self-Supervised Image Classification	ImageNet (finetuned)	MIRL (ViT-B-48)	Top 1 Accuracy	86.2%	# 16
Self-Supervised Image Classification	ImageNet (finetuned)	MIRL (ViT-S-54)	Number of Params	96M	# 31
Self-Supervised Image Classification	ImageNet (finetuned)	MIRL (ViT-S-54)	Top 1 Accuracy	84.8%	# 26

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/masked-image-residual-learning-for-scaling-1/self-supervised-image-classification-on-1)](https://paperswithcode.com/sota/self-supervised-image-classification-on-1?p=masked-image-residual-learning-for-scaling-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/masked-image-residual-learning-for-scaling-1/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=masked-image-residual-learning-for-scaling-1)`

Masked Image Residual Learning for Scaling Deeper Vision Transformers

NeurIPS 2023 · Guoxi Huang, Hongtao Fu, Adrian G. Bors ·

Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. To ease the training of deeper ViTs, we introduce a self-supervised learning framework called Masked Image Residual Learning (MIRL), which significantly alleviates the degradation problem, making scaling ViT along depth a promising direction for performance upgrade. We reformulate the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image. We provide extensive empirical evidence showing that deeper ViTs can be effectively optimized using MIRL and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, we instantiate 4.5$\times$ and 2$\times$ deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54, costing 3$\times$ less than ViT-Large, achieves performance on par with ViT-Large. ViT-B-48 achieves 86.2% top-1 accuracy on ImageNet. On one hand, deeper ViTs pre-trained with MIRL exhibit excellent generalization capabilities on downstream tasks, such as object detection and semantic segmentation. On the other hand, MIRL demonstrates high pre-training efficiency. With less pre-training time, MIRL yields competitive performance compared to other approaches.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

russellllaputa/MIRL official

Tasks

Add Remove

Image Classification

object-detection

Object Detection

Self-Supervised Image Classification

Self-Supervised Learning

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Results from the Paper

Add Remove

Ranked #16 on Self-Supervised Image Classification on ImageNet (finetuned)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	MIRL(ViT-S-54)	Top 1 Accuracy	84.8%	# 270	Compare
			Number of params	96M	# 856	Compare
			GFLOPs	18.8	# 361	Compare
Image Classification	ImageNet	MIRL (ViT-B-48)	Top 1 Accuracy	86.2%	# 164	Compare
			Number of params	341M	# 923	Compare
			GFLOPs	67.0	# 438	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	MIRL (ViT-B-48)	Number of Params	341M	# 12	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	MIRL (ViT-B-48)	Top 1 Accuracy	86.2%	# 16	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	MIRL (ViT-S-54)	Number of Params	96M	# 31	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	MIRL (ViT-S-54)	Top 1 Accuracy	84.8%	# 26	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Masked Image Residual Learning for Scaling Deeper Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove