TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	AI2D	SMoLA-PaLI-X Specialist Model	EM	82.5	# 1
Visual Question Answering (VQA)	AI2D	SMoLA-PaLI-X Generalist Model	EM	81.4	# 2
Visual Question Answering (VQA)	A-OKVQA	SMoLA-PaLI-X Specialist Model	MC Accuracy	83.75	# 1
Visual Question Answering (VQA)	A-OKVQA	SMoLA-PaLI-X Specialist Model	DA VQA Score	70.55	# 1
Chart Question Answering	ChartQA	SMoLA-PaLI-X Generalist Model	1:1 Accuracy	73.8	# 8
Chart Question Answering	ChartQA	SMoLA-PaLI-X Specialist Model	1:1 Accuracy	74.6	# 7
Visual Question Answering (VQA)	DocVQA test	SMoLA-PaLI-X Generalist	ANLS	0.906	# 3
Visual Question Answering (VQA)	DocVQA test	SMoLA-PaLI-X Specialist	ANLS	0.908	# 2
Visual Question Answering (VQA)	InfographicVQA	SMoLA-PaLI-X Generalist	ANLS	65.6	# 4
Visual Question Answering (VQA)	InfographicVQA	SMoLA-PaLI-X Specialist	ANLS	66.2	# 2
Object Counting	TallyQA-Complex	SMoLA-PaLI-X Generalist (0 shot)	Accuracy	70.7	# 3
Object Counting	TallyQA-Complex	SMoLA-PaLI-X Specialist	Accuracy	77.1	# 1
Object Counting	TallyQA-Simple	SMoLA-PaLI-X Specialist	Accuracy	86.3	# 1
Object Counting	TallyQA-Simple	SMoLA-PaLI-X Generalist (0 shot)	Accuracy	83.3	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omni-smola-boosting-generalist-multimodal/visual-question-answering-vqa-on-ai2d)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-ai2d?p=omni-smola-boosting-generalist-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omni-smola-boosting-generalist-multimodal/visual-question-answering-on-a-okvqa)](https://paperswithcode.com/sota/visual-question-answering-on-a-okvqa?p=omni-smola-boosting-generalist-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omni-smola-boosting-generalist-multimodal/object-counting-on-tallyqa-complex)](https://paperswithcode.com/sota/object-counting-on-tallyqa-complex?p=omni-smola-boosting-generalist-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omni-smola-boosting-generalist-multimodal/object-counting-on-tallyqa-simple)](https://paperswithcode.com/sota/object-counting-on-tallyqa-simple?p=omni-smola-boosting-generalist-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omni-smola-boosting-generalist-multimodal/visual-question-answering-on-docvqa-test)](https://paperswithcode.com/sota/visual-question-answering-on-docvqa-test?p=omni-smola-boosting-generalist-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omni-smola-boosting-generalist-multimodal/visual-question-answering-vqa-on)](https://paperswithcode.com/sota/visual-question-answering-vqa-on?p=omni-smola-boosting-generalist-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omni-smola-boosting-generalist-multimodal/chart-question-answering-on-chartqa)](https://paperswithcode.com/sota/chart-question-answering-on-chartqa?p=omni-smola-boosting-generalist-multimodal)`

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

1 Dec 2023 · Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut ·

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. Recent research suggests that Mixture of Experts (MoE) architectures are useful for instruction tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to (softly) mix many multimodal low rank experts, and avoids introducing a significant number of new parameters compared to conventional MoE models. The core intuition here is that the large model provides a foundational backbone, while different lightweight experts residually learn specialized knowledge, either per-modality or multimodally. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of generative vision-and-language tasks, achieving new SoTA generalist performance that often matches or outperforms single specialized LMM baselines, as well as new SoTA specialist performance.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Chart Question Answering

Document AI

Image Captioning

Object Counting

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Question Answering

OK-VQA

TextVQA

VizWiz DocVQA

A-OKVQA

ST-VQA ChartQA TextCaps

AI2D

InfographicVQA TallyQA VizWiz-Captions OCR-VQA

Results from the Paper

Edit

Ranked #1 on Visual Question Answering (VQA) on A-OKVQA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	AI2D	SMoLA-PaLI-X Specialist Model	EM	82.5	# 1	Compare
Visual Question Answering (VQA)	AI2D	SMoLA-PaLI-X Generalist Model	EM	81.4	# 2	Compare
Visual Question Answering (VQA)	A-OKVQA	SMoLA-PaLI-X Specialist Model	MC Accuracy	83.75	# 1	Compare
Visual Question Answering (VQA)	A-OKVQA	SMoLA-PaLI-X Specialist Model	DA VQA Score	70.55	# 1	Compare
Chart Question Answering	ChartQA	SMoLA-PaLI-X Generalist Model	1:1 Accuracy	73.8	# 8	Compare
Chart Question Answering	ChartQA	SMoLA-PaLI-X Specialist Model	1:1 Accuracy	74.6	# 7	Compare
Visual Question Answering (VQA)	DocVQA test	SMoLA-PaLI-X Generalist	ANLS	0.906	# 3	Compare
Visual Question Answering (VQA)	DocVQA test	SMoLA-PaLI-X Specialist	ANLS	0.908	# 2	Compare
Visual Question Answering (VQA)	InfographicVQA	SMoLA-PaLI-X Generalist	ANLS	65.6	# 4	Compare
Visual Question Answering (VQA)	InfographicVQA	SMoLA-PaLI-X Specialist	ANLS	66.2	# 2	Compare
Object Counting	TallyQA-Complex	SMoLA-PaLI-X Generalist (0 shot)	Accuracy	70.7	# 3	Compare
Object Counting	TallyQA-Complex	SMoLA-PaLI-X Specialist	Accuracy	77.1	# 1	Compare
Object Counting	TallyQA-Simple	SMoLA-PaLI-X Specialist	Accuracy	86.3	# 1	Compare
Object Counting	TallyQA-Simple	SMoLA-PaLI-X Generalist (0 shot)	Accuracy	83.3	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove