Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

PDF Abstract CVPR 2024 PDF CVPR 2024 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Question Answering ActivityNet-QA LocVLM-Vid-B+ Accuracy 38.2 # 27
Video Question Answering ActivityNet-QA LocVLM-Vid-B Accuracy 37.4 # 28
Visual Question Answering GQA LocVLM-L Accuracy 50.2 # 1
Video Question Answering MSR-VTT LocVLM-Vid-B Accuracy 51.2 # 1
Video Question Answering MSVD-QA LocVLM-Vid-B Accuracy 66.1 # 1
Zero-Shot Region Description RefCOCOg-test LocVLM-B METEOR 26.2 # 1
Zero-Shot Region Description RefCOCOg-val LocVLM-B METEOR 26 # 1
Zero-Shot Region Description RefCOCO testB LocVLM-B METEOR 14.6 # 1
Zero-Shot Region Description RefCOCO+ test B LocVLM-B METEOR 15.2 # 1
Video Question Answering TGIF-QA LocVLM-Vid-B Accuracy 51.8 # 1
Visual Question Answering VQA v2 test-dev LocVLM-L Accuracy 56.2 # 11
Visual Question Answering VQA v2 val LocVLM-L Accuracy 55.9 # 4

Methods


No methods listed for this paper. Add relevant methods here