TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	PeCo (ViT-H, 448)	Top 1 Accuracy	88.3%	# 62
Image Classification	ImageNet	PeCo (ViT-H, 448)	Number of params	656M	# 946
Image Classification	ImageNet	PeCo (ViT-H, 224)	Top 1 Accuracy	87.5%	# 86
Self-Supervised Image Classification	ImageNet (finetuned)	PeCo(ViT-H/14, 448)	Number of Params	632M	# 7
Self-Supervised Image Classification	ImageNet (finetuned)	PeCo(ViT-H/14, 448)	Top 1 Accuracy	88.3%	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/peco-perceptual-codebook-for-bert-pre/self-supervised-image-classification-on-1)](https://paperswithcode.com/sota/self-supervised-image-classification-on-1?p=peco-perceptual-codebook-for-bert-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/peco-perceptual-codebook-for-bert-pre/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=peco-perceptual-codebook-for-bert-pre)`

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

24 Nov 2021 · Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu ·

This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment.This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity.We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve $\textbf{84.5\%}$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by $\textbf{+1.3\%}$ under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (\textbf{88.3\%}) among methods using only ImageNet-1K data.

PDF Abstract

Code

Add Remove Mark official

xyzforever/bevt

153

Tasks

Add Remove

Image Classification

object-detection

Object Detection

Self-Supervised Image Classification

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Perceptual Similarity

Results from the Paper

Edit

Ranked #4 on Self-Supervised Image Classification on ImageNet (finetuned)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	PeCo (ViT-H, 448)	Top 1 Accuracy	88.3%	# 62	Compare
Image Classification	ImageNet	PeCo (ViT-H, 448)	Number of params	656M	# 946	Compare
Image Classification	ImageNet	PeCo (ViT-H, 224)	Top 1 Accuracy	87.5%	# 86	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	PeCo(ViT-H/14, 448)	Number of Params	632M	# 7	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	PeCo(ViT-H/14, 448)	Top 1 Accuracy	88.3%	# 4	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • VAE • Weight Decay • WordPiece

Edit Social Preview

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove