TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Visual Question Answering	MM-Vet	LayoutLMv3+ConvNeXt+CLIP	GPT-4 score	38.4	# 42
Visual Question Answering	MM-Vet	LayoutLMv3+ConvNeXt+CLIP	Params	7.9B	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mousi-poly-visual-expert-vision-language/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=mousi-poly-visual-expert-vision-language)`

MouSi: Poly-Visual-Expert Vision-Language Models

30 Jan 2024 · Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang ·

Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.

PDF Abstract

Code

Add Remove Mark official

fudannlplab/mousi official

Tasks

Add Remove

Image Segmentation

Image-text matching

Optical Character Recognition (OCR)

Semantic Segmentation

Text Matching

Visual Question Answering

Datasets

Visual Question Answering

GQA

MM-Vet

Results from the Paper

Add Remove

Ranked #42 on Visual Question Answering on MM-Vet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering	MM-Vet	LayoutLMv3+ConvNeXt+CLIP	GPT-4 score	38.4	# 42	Compare
Visual Question Answering	MM-Vet	LayoutLMv3+ConvNeXt+CLIP	Params	7.9B	# 1	Compare

Methods

Add Remove

SAM

Edit Social Preview

MouSi: Poly-Visual-Expert Vision-Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove