Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

1 Dec 2023  ยท  Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut ยท

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. Recent research suggests that Mixture of Experts (MoE) architectures are useful for instruction tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to (softly) mix many multimodal low rank experts, and avoids introducing a significant number of new parameters compared to conventional MoE models. The core intuition here is that the large model provides a foundational backbone, while different lightweight experts residually learn specialized knowledge, either per-modality or multimodally. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of generative vision-and-language tasks, achieving new SoTA generalist performance that often matches or outperforms single specialized LMM baselines, as well as new SoTA specialist performance.

PDF Abstract

Results from the Paper


 Ranked #1 on Visual Question Answering (VQA) on A-OKVQA (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Visual Question Answering (VQA) AI2D SMoLA-PaLI-X Specialist Model EM 82.5 # 1
Visual Question Answering (VQA) AI2D SMoLA-PaLI-X Generalist Model EM 81.4 # 2
Visual Question Answering (VQA) A-OKVQA SMoLA-PaLI-X Specialist Model MC Accuracy 83.75 # 1
DA VQA Score 70.55 # 1
Chart Question Answering ChartQA SMoLA-PaLI-X Generalist Model 1:1 Accuracy 73.8 # 8
Chart Question Answering ChartQA SMoLA-PaLI-X Specialist Model 1:1 Accuracy 74.6 # 7
Visual Question Answering (VQA) DocVQA test SMoLA-PaLI-X Generalist ANLS 0.906 # 3
Visual Question Answering (VQA) DocVQA test SMoLA-PaLI-X Specialist ANLS 0.908 # 2
Visual Question Answering (VQA) InfographicVQA SMoLA-PaLI-X Generalist ANLS 65.6 # 4
Visual Question Answering (VQA) InfographicVQA SMoLA-PaLI-X Specialist ANLS 66.2 # 2
Object Counting TallyQA-Complex SMoLA-PaLI-X Generalist (0 shot) Accuracy 70.7 # 3
Object Counting TallyQA-Complex SMoLA-PaLI-X Specialist Accuracy 77.1 # 1
Object Counting TallyQA-Simple SMoLA-PaLI-X Specialist Accuracy 86.3 # 1
Object Counting TallyQA-Simple SMoLA-PaLI-X Generalist (0 shot) Accuracy 83.3 # 3

Methods


No methods listed for this paper. Add relevant methods here