TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	BEiT-L (ViT+UperNet)	Validation mIoU	57.0	# 29
Semantic Segmentation	ADE20K val	BEiT-L (ViT+UperNet, ImageNet-22k pretrain)	mIoU	57.0	# 20
Image Classification	ImageNet	BEiT-L (ViT; ImageNet 1k pretrain)	Top 1 Accuracy	86.3%	# 153
Image Classification	ImageNet	BEiT-L (ViT; ImageNet 1k pretrain)	Number of params	86M	# 814
Image Classification	ImageNet	BEiT-L (ViT; ImageNet-22K pretrain)	Top 1 Accuracy	88.60%	# 44
Image Classification	ImageNet	BEiT-L (ViT; ImageNet-22K pretrain)	Number of params	331M	# 919
Self-Supervised Image Classification	ImageNet (finetuned)	BEiT-L (ViT)	Number of Params	307M	# 13
Self-Supervised Image Classification	ImageNet (finetuned)	BEiT-L (ViT)	Top 1 Accuracy	86.3%	# 14
Self-Supervised Image Classification	ImageNet (finetuned)	BEiT-B (ViT)	Number of Params	86M	# 36
Self-Supervised Image Classification	ImageNet (finetuned)	BEiT-B (ViT)	Top 1 Accuracy	84.6%	# 29
Image Classification	OmniBenchmark	BeiT	Average Top-1 Accuracy	30.1	# 22
Document Layout Analysis	PubLayNet val	BEiT-B	Text	0.934	# 8
Document Layout Analysis	PubLayNet val	BEiT-B	Title	0.866	# 9
Document Layout Analysis	PubLayNet val	BEiT-B	List	0.924	# 9
Document Layout Analysis	PubLayNet val	BEiT-B	Table	0.973	# 9
Document Layout Analysis	PubLayNet val	BEiT-B	Figure	0.957	# 9
Document Layout Analysis	PubLayNet val	BEiT-B	Overall	0.931	# 10
Document Image Classification	RVL-CDIP	BEiT-B	Accuracy	91.09%	# 26
Document Image Classification	RVL-CDIP	BEiT-B	Parameters	87M	# 15

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beit-bert-pre-training-of-image-transformers/document-layout-analysis-on-publaynet-val)](https://paperswithcode.com/sota/document-layout-analysis-on-publaynet-val?p=beit-bert-pre-training-of-image-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beit-bert-pre-training-of-image-transformers/self-supervised-image-classification-on-1)](https://paperswithcode.com/sota/self-supervised-image-classification-on-1?p=beit-bert-pre-training-of-image-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beit-bert-pre-training-of-image-transformers/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=beit-bert-pre-training-of-image-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beit-bert-pre-training-of-image-transformers/image-classification-on-omnibenchmark)](https://paperswithcode.com/sota/image-classification-on-omnibenchmark?p=beit-bert-pre-training-of-image-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beit-bert-pre-training-of-image-transformers/document-image-classification-on-rvl-cdip)](https://paperswithcode.com/sota/document-image-classification-on-rvl-cdip?p=beit-bert-pre-training-of-image-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beit-bert-pre-training-of-image-transformers/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=beit-bert-pre-training-of-image-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beit-bert-pre-training-of-image-transformers/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=beit-bert-pre-training-of-image-transformers)`

BEiT: BERT Pre-Training of Image Transformers

ICLR 2022 · Hangbo Bao, Li Dong, Songhao Piao, Furu Wei ·

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract

Code

Add Remove Mark official

microsoft/unilm official

18,280

huggingface/transformers

124,593

rwightman/pytorch-image-models

29,680

facebookresearch/vissl

↳ Quickstart in

Colab

3,227

pengzhiliang/MAE-pytorch

2,521

See all 11 implementations

Tasks

Add Remove

Document Image Classification

Document Layout Analysis

Image Classification

Self-Supervised Image Classification

Semantic Segmentation

Datasets

ImageNet

ADE20K ImageNet-1K PubLayNet

RVL-CDIP

OmniBenchmark

Results from the Paper

Add Remove

Ranked #10 on Document Layout Analysis on PubLayNet val

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	BEiT-L (ViT+UperNet)	Validation mIoU	57.0	# 29	Compare
Semantic Segmentation	ADE20K val	BEiT-L (ViT+UperNet, ImageNet-22k pretrain)	mIoU	57.0	# 20	Compare
Image Classification	ImageNet	BEiT-L (ViT; ImageNet 1k pretrain)	Top 1 Accuracy	86.3%	# 153	Compare
Image Classification	ImageNet	BEiT-L (ViT; ImageNet 1k pretrain)	Number of params	86M	# 814	Compare
Image Classification	ImageNet	BEiT-L (ViT; ImageNet-22K pretrain)	Top 1 Accuracy	88.60%	# 44	Compare
Image Classification	ImageNet	BEiT-L (ViT; ImageNet-22K pretrain)	Number of params	331M	# 919	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	BEiT-L (ViT)	Number of Params	307M	# 13	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	BEiT-L (ViT)	Top 1 Accuracy	86.3%	# 14	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	BEiT-B (ViT)	Number of Params	86M	# 36	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	BEiT-B (ViT)	Top 1 Accuracy	84.6%	# 29	Compare
Image Classification	OmniBenchmark	BeiT	Average Top-1 Accuracy	30.1	# 22	Compare
Document Layout Analysis	PubLayNet val	BEiT-B	Text	0.934	# 8	Compare
			Title	0.866	# 9	Compare
			List	0.924	# 9	Compare
			Table	0.973	# 9	Compare
			Figure	0.957	# 9	Compare
			Overall	0.931	# 10	Compare
Document Image Classification	RVL-CDIP	BEiT-B	Accuracy	91.09%	# 26	Compare
Document Image Classification	RVL-CDIP	BEiT-B	Parameters	87M	# 15	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Attention Dropout • BERT • BPE • DeiT • Dense Connections • Dropout • Feedforward Network • GELU • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Weight Decay • WordPiece

Edit Social Preview

BEiT: BERT Pre-Training of Image Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove