TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Self-Supervised Image Classification	ImageNet	EsViT (Swin-B)	Top 1 Accuracy	81.3%	# 16
Self-Supervised Image Classification	ImageNet	EsViT (Swin-B)	Top 5 Accuracy	95.5%	# 1
Self-Supervised Image Classification	ImageNet	EsViT (Swin-B)	Number of Params	87M	# 34
Self-Supervised Image Classification	ImageNet	EsViT(Swin-S)	Top 1 Accuracy	80.8%	# 20
Self-Supervised Image Classification	ImageNet	EsViT(Swin-S)	Number of Params	49M	# 44
Self-Supervised Image Classification	ImageNet (finetuned)	EsViT (Swin-B)	Number of Params	87M	# 34
Self-Supervised Image Classification	ImageNet (finetuned)	EsViT (Swin-B)	Top 1 Accuracy	83.9%	# 40

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-self-supervised-vision-transformers/self-supervised-image-classification-on)](https://paperswithcode.com/sota/self-supervised-image-classification-on?p=efficient-self-supervised-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-self-supervised-vision-transformers/self-supervised-image-classification-on-1)](https://paperswithcode.com/sota/self-supervised-image-classification-on-1?p=efficient-self-supervised-vision-transformers)`

Efficient Self-supervised Vision Transformers for Representation Learning

ICLR 2022 · Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao ·

This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models are publicly available: https://github.com/microsoft/esvit

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract

Code

Add Remove Mark official

microsoft/esvit official

403

Tasks

Add Remove

Representation Learning

Self-Supervised Image Classification

Datasets

ImageNet

Results from the Paper

Edit

Ranked #16 on Self-Supervised Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Self-Supervised Image Classification	ImageNet	EsViT (Swin-B)	Top 1 Accuracy	81.3%	# 16	Compare
			Top 5 Accuracy	95.5%	# 1	Compare
			Number of Params	87M	# 34	Compare
Self-Supervised Image Classification	ImageNet	EsViT(Swin-S)	Top 1 Accuracy	80.8%	# 20	Compare
Self-Supervised Image Classification	ImageNet	EsViT(Swin-S)	Number of Params	49M	# 44	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	EsViT (Swin-B)	Number of Params	87M	# 34	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	EsViT (Swin-B)	Top 1 Accuracy	83.9%	# 40	Compare

Methods

Add Remove

EsViT

Edit Social Preview

Efficient Self-supervised Vision Transformers for Representation Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove