PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

This paper explores a better codebook for BERT pre-training of vision transformers. The recent work BEiT successfully transfers BERT pre-training from NLP to the vision field. It directly adopts one simple discrete VAE as the visual tokenizer, but has not considered the semantic level of the resulting visual tokens. By contrast, the discrete tokens in NLP field are naturally highly semantic. This difference motivates us to learn a perceptual codebook. And we surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3 with the same pre-training epochs. It can also improve the performance of object detection and segmentation tasks on COCO val by +1.3 box AP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art performance (88.3% Top-1 accuracy) among the methods using only ImageNet-1K data. The code and models will be available at

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Classification ImageNet PeCo (ViT-H, 448) Top 1 Accuracy 88.3% # 30
Number of params 656M # 16
Image Classification ImageNet PeCo (ViT-H, 224) Top 1 Accuracy 87.5% # 40