TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering	BenchLMM	Sphinx-V2-1K	GPT-3.5 score	57.43	# 2
Described Object Detection	Description Detection Dataset	SPHINX-7B	Intra-scenario FULL mAP	10.6	# 6
Described Object Detection	Description Detection Dataset	SPHINX-7B	Intra-scenario PRES mAP	11.4	# 6
Described Object Detection	Description Detection Dataset	SPHINX-7B	Intra-scenario ABS mAP	7.9	# 7
Visual Question Answering (VQA)	InfiMM-Eval	SPHINX v2	Overall score	39.48	# 2
Visual Question Answering (VQA)	InfiMM-Eval	SPHINX v2	Deductive	42.17	# 2
Visual Question Answering (VQA)	InfiMM-Eval	SPHINX v2	Abductive	49.85	# 2
Visual Question Answering (VQA)	InfiMM-Eval	SPHINX v2	Analogical	20.69	# 6
Visual Question Answering (VQA)	InfiMM-Eval	SPHINX v2	Params	16B	# 1
Visual Question Answering	MM-Vet	SPHINX-2k	GPT-4 score	40.2	# 36

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sphinx-the-joint-mixing-of-weights-tasks-and/visual-question-answering-on-benchlmm)](https://paperswithcode.com/sota/visual-question-answering-on-benchlmm?p=sphinx-the-joint-mixing-of-weights-tasks-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sphinx-the-joint-mixing-of-weights-tasks-and/visual-question-answering-vqa-on-core-mm)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-core-mm?p=sphinx-the-joint-mixing-of-weights-tasks-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sphinx-the-joint-mixing-of-weights-tasks-and/described-object-detection-on-description)](https://paperswithcode.com/sota/described-object-detection-on-description?p=sphinx-the-joint-mixing-of-weights-tasks-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sphinx-the-joint-mixing-of-weights-tasks-and/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=sphinx-the-joint-mixing-of-weights-tasks-and)`

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

13 Nov 2023 · Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, Yu Qiao ·

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

PDF Abstract

Code

Add Remove Mark official

alpha-vllm/llama2-accessory official

2,543

Tasks

Add Remove

Described Object Detection

Language Modelling

Large Language Model

Pose Estimation

Question Answering

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Question Answering ImageNet-1K

GQA

RefCOCO

OK-VQA

TextVQA

ScienceQA

VizWiz

LAION-400M

MMBench

MM-Vet LLaVA-Bench

VSR

InfiMM-Eval

BenchLMM

Description Detection Dataset

Results from the Paper

Edit

Ranked #2 on Visual Question Answering on BenchLMM

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering	BenchLMM	Sphinx-V2-1K	GPT-3.5 score	57.43	# 2	Compare
Described Object Detection	Description Detection Dataset	SPHINX-7B	Intra-scenario FULL mAP	10.6	# 6	Compare
			Intra-scenario PRES mAP	11.4	# 6	Compare
			Intra-scenario ABS mAP	7.9	# 7	Compare
Visual Question Answering (VQA)	InfiMM-Eval	SPHINX v2	Overall score	39.48	# 2	Compare
			Deductive	42.17	# 2	Compare
			Abductive	49.85	# 2	Compare
			Analogical	20.69	# 6	Compare
			Params	16B	# 1	Compare
Visual Question Answering	MM-Vet	SPHINX-2k	GPT-4 score	40.2	# 36	Compare

Methods

Add Remove

Visual Parsing

Edit Social Preview

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove