TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Chart Question Answering	ChartQA	PaLI-3 (w/ OCR)	1:1 Accuracy	69.5	# 14
Chart Question Answering	ChartQA	PaLI-3	1:1 Accuracy	70	# 13
Visual Question Answering (VQA)	DocVQA test	PaLI-3	ANLS	0.876	# 11
Visual Question Answering (VQA)	DocVQA test	PaLI-3 (w/ OCR)	ANLS	0.886	# 6
Visual Question Answering (VQA)	InfographicVQA	PaLI-3 (w/ OCR)	ANLS	62.4	# 6
Visual Question Answering (VQA)	InfographicVQA	PaLI-3	ANLS	57.8	# 8
Temporal/Casual QA	NExT-QA	PaLI-3	WUPS	37.7	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pali-3-vision-language-models-smaller-faster/temporal-casual-qa-on-next-qa)](https://paperswithcode.com/sota/temporal-casual-qa-on-next-qa?p=pali-3-vision-language-models-smaller-faster)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pali-3-vision-language-models-smaller-faster/visual-question-answering-on-docvqa-test)](https://paperswithcode.com/sota/visual-question-answering-on-docvqa-test?p=pali-3-vision-language-models-smaller-faster)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pali-3-vision-language-models-smaller-faster/visual-question-answering-vqa-on)](https://paperswithcode.com/sota/visual-question-answering-vqa-on?p=pali-3-vision-language-models-smaller-faster)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pali-3-vision-language-models-smaller-faster/chart-question-answering-on-chartqa)](https://paperswithcode.com/sota/chart-question-answering-on-chartqa?p=pali-3-vision-language-models-smaller-faster)`

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

13 Oct 2023 · Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut ·

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

PDF Abstract

Code

Add Remove Mark official

kyegomez/PALI3

117

Tasks

Add Remove

Chart Question Answering

Cross-Modal Retrieval

Image Classification

Language Modelling

multilingual cross-modal retrieval

Retrieval

Temporal/Casual QA

Visual Question Answering (VQA)

Datasets

CelebA

MSR-VTT

OK-VQA

TextVQA FairFace DocVQA

VATEX

NExT-QA

ST-VQA ChartQA TextCaps

InfographicVQA TallyQA

MIAP

Results from the Paper

Edit

Ranked #2 on Temporal/Casual QA on NExT-QA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Chart Question Answering	ChartQA	PaLI-3 (w/ OCR)	1:1 Accuracy	69.5	# 14	Compare
Chart Question Answering	ChartQA	PaLI-3	1:1 Accuracy	70	# 13	Compare
Visual Question Answering (VQA)	DocVQA test	PaLI-3	ANLS	0.876	# 11	Compare
Visual Question Answering (VQA)	DocVQA test	PaLI-3 (w/ OCR)	ANLS	0.886	# 6	Compare
Visual Question Answering (VQA)	InfographicVQA	PaLI-3 (w/ OCR)	ANLS	62.4	# 6	Compare
Visual Question Answering (VQA)	InfographicVQA	PaLI-3	ANLS	57.8	# 8	Compare
Temporal/Casual QA	NExT-QA	PaLI-3	WUPS	37.7	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove