TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Visual Question Answering	MM-Vet	CoLLaVO-7B	GPT-4 score	40.3	# 34

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/collavo-crayon-large-language-and-vision/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=collavo-crayon-large-language-and-vision)`

CoLLaVO: Crayon Large Language and Vision mOdel

17 Feb 2024 · Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro ·

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from `what objects are in the image?' or `which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

PDF Abstract