Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Transfer Image Classification ImageNet LiT ViT-e Accuracy (Private) 85.4 # 6
Zero-Shot Transfer Image Classification ImageNet PaLI Accuracy (Private) 72.11 # 20
Image Classification ImageNet ViT-e Top 1 Accuracy 90.9% # 6
Number of params 3900M # 977
Zero-Shot Transfer Image Classification ImageNet-A PaLI Accuracy (Private) 44.7 # 13
Zero-Shot Transfer Image Classification ImageNet-A LiT ViT-e Accuracy (Private) 88.0 # 3
Zero-Shot Transfer Image Classification ImageNet-R PaLI Accuracy 81.97 # 12
Zero-Shot Transfer Image Classification ImageNet-R LiT ViT-e Accuracy 96.1 # 3
Zero-Shot Transfer Image Classification ImageNet-S PaLI Top 5 Accuracy 79.3 # 1
Accuracy (Private) 63.83 # 1
Zero-Shot Transfer Image Classification ImageNet V2 LiT ViT-e Accuracy (Private) 80.6 # 4
Image Classification ImageNet V2 ViT-e Top 1 Accuracy 84.3 # 2
Zero-Shot Transfer Image Classification ImageNet V2 PaLI Accuracy (Private) 64.46 # 13
Image Captioning nocaps in-domain PaLI CIDEr 149.1 # 1
CIDEr 121.09 # 4
B1 88.02 # 3
B2 75.21 # 3
B3 59.38 # 3
B4 41.16 # 2
ROUGE-L 64.39 # 1
METEOR 34.22 # 1
SPICE 15.69 # 3
Image Captioning nocaps near-domain PaLI SPICE 15.75 # 3
CIDEr 124.35 # 2
B1 88.57 # 2
B2 75.56 # 2
B3 58.99 # 1
B4 39.98 # 1
ROUGE-L 63.99 # 1
METEOR 33.47 # 1
SPICE 15.75 # 3
Image Captioning nocaps out-of-domain PaLI CIDEr 126.67 # 1
B1 86.28 # 1
B2 71.19 # 2
B3 52.63 # 2
B4 32.0 # 1
ROUGE-L 61.35 # 1
METEOR 30.99 # 1
SPICE 15.49 # 3
Image Classification ObjectNet ViT-e Top-1 Accuracy 72.0 # 13
Zero-Shot Transfer Image Classification ObjectNet LiT ViT-e Accuracy (Private) 84.9 # 2
Zero-Shot Transfer Image Classification ObjectNet PaLI Accuracy (Private) 42.62 # 9
Top 5 Accuracy 58.35 # 1
Visual Question Answering (VQA) OK-VQA PaLI 17B Accuracy 64.5 # 4
Visual Question Answering (VQA) TextVQA test-standard PaLI overall 73.1 # 1
Visual Question Answering (VQA) VizWiz 2020 VQA PaLI overall 73.3 # 1
Visual Question Answering (VQA) VQA v2 test-dev PaLI Accuracy 84.3 # 1

Methods


No methods listed for this paper. Add relevant methods here