iBOT: Image BERT Pre-Training with Online Tokenizer

15 Nov 2021  Β·  Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong Β·

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Semantic Segmentation ADE20K iBOT (ViT-B/16) (linear head) Validation mIoU 38.3 # 214
Semantic Segmentation ADE20K iBOT (ViT-S/16) Validation mIoU 45.4 # 183
Semantic Segmentation ADE20K iBOT (ViT-B/16) Validation mIoU 50.0 # 114
Instance Segmentation COCO test-dev iBOT (ViT-B/16) mask AP 44.2 # 43
Instance Segmentation COCO test-dev iBOT (ViT-S/16) mask AP 42.6 # 51
Object Detection COCO test-dev iBOT (ViT-S/16) box mAP 49.4 # 90
Object Detection COCO test-dev iBOT (ViT-B/16) box mAP 51.2 # 78
Unsupervised Image Classification ImageNet iBOT (ViT-S/16) Accuracy (%) 43.4 # 2
ARI 32.8 # 1
Self-Supervised Image Classification ImageNet iBOT (ViT-L/16) (IN22k) Top 1 Accuracy 82.3% # 11
Number of Params 307M # 16
Self-Supervised Image Classification ImageNet iBOT (ViT-L/16) Top 1 Accuracy 81.3% # 16
Number of Params 307M # 16
Semi-Supervised Image Classification ImageNet - 1% labeled data iBOT (ViT-S/16) Top 1 Accuracy 61.9% # 30
Self-Supervised Image Classification ImageNet (finetuned) iBOT(ViT-L/16, 512) Number of Params 307M # 13
Top 1 Accuracy 87.8% # 7
Self-Supervised Image Classification ImageNet (finetuned) iBOT(ViT-L/16) Number of Params 307M # 13
Top 1 Accuracy 86.6% # 12
Self-Supervised Image Classification ImageNet (finetuned) iBOT (ViT-L/16) Number of Params 307M # 13
Top 1 Accuracy 84.8% # 26
Self-Supervised Image Classification ImageNet (finetuned) iBOT (ViT-B/16) Number of Params 85M # 39
Top 1 Accuracy 84.4% # 31
Number of Params 85M # 39
Top 1 Accuracy 84.0% # 38

Methods


No methods listed for this paper. Add relevant methods here