Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).
PDF AbstractCode
Tasks














Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Semantic Segmentation | ADE20K | BEiT-3 | Validation mIoU | 62.8 | # 4 | ||
Params (M) | 1900 | # 1 | |||||
Semantic Segmentation | ADE20K val | BEiT-3 | mIoU | 62.8 | # 1 | ||
Cross-Modal Retrieval | COCO 2014 | BEiT-3 | Image-to-text R@1 | 84.8 | # 1 | ||
Image-to-text R@10 | 98.3 | # 5 | |||||
Image-to-text R@5 | 96.5 | # 2 | |||||
Text-to-image R@1 | 67.2 | # 4 | |||||
Text-to-image R@10 | 87.7 | # 16 | |||||
Text-to-image R@5 | 92.8 | # 1 | |||||
Instance Segmentation | COCO test-dev | BEiT-3 | mask AP | 54.8 | # 4 | ||
Object Detection | COCO test-dev | BEiT-3 | box mAP | 63.7 | # 11 | ||
Zero-Shot Cross-Modal Retrieval | Flickr30k | BEiT-3 | Image-to-text R@1 | 94.9 | # 1 | ||
Image-to-text R@5 | 99.9 | # 1 | |||||
Image-to-text R@10 | 100.0 | # 1 | |||||
Text-to-image R@1 | 81.5 | # 1 | |||||
Text-to-image R@5 | 95.6 | # 2 | |||||
Text-to-image R@10 | 97.8 | # 2 | |||||
Cross-Modal Retrieval | Flickr30k | BEiT-3 | Image-to-text R@1 | 98.0 | # 3 | ||
Image-to-text R@10 | 100.0 | # 1 | |||||
Image-to-text R@5 | 100.0 | # 1 | |||||
Text-to-image R@1 | 90.3 | # 4 | |||||
Text-to-image R@10 | 99.5 | # 2 | |||||
Text-to-image R@5 | 98.7 | # 2 | |||||
Visual Reasoning | NLVR2 Dev | BEiT-3 | Accuracy | 91.51 | # 1 | ||
Visual Reasoning | NLVR2 Test | BEiT-3 | Accuracy | 92.58 | # 1 | ||
Visual Question Answering (VQA) | VQA v2 test-dev | BEiT-3 | Accuracy | 84.19 | # 2 | ||
Visual Question Answering (VQA) | VQA v2 test-std | BEiT-3 | overall | 84.03 | # 1 |