TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Reasoning	Winoground	LLaVA-1.5-CCoT	Text Score	42.0	# 30
Visual Reasoning	Winoground	LLaVA-1.5-CCoT	Image Score	35.5	# 16
Visual Reasoning	Winoground	LLaVA-1.5-CCoT	Group Score	22.3	# 23
Visual Reasoning	Winoground	LLaVA-1.5-ZS-CoT	Text Score	28.0	# 78
Visual Reasoning	Winoground	LLaVA-1.5-ZS-CoT	Image Score	22.5	# 43
Visual Reasoning	Winoground	LLaVA-1.5-ZS-CoT	Group Score	12.3	# 55
Visual Reasoning	Winoground	LLaVA-1.5	Text Score	36.0	# 48
Visual Reasoning	Winoground	LLaVA-1.5	Image Score	33.3	# 17
Visual Reasoning	Winoground	LLaVA-1.5	Group Score	20.1	# 30
Visual Reasoning	Winoground	InstructBLIP-CCoT	Text Score	21.0	# 96
Visual Reasoning	Winoground	InstructBLIP-CCoT	Image Score	21.3	# 47
Visual Reasoning	Winoground	InstructBLIP-CCoT	Group Score	8.3	# 77
Visual Reasoning	Winoground	InstructBLIP-ZS-CoT	Text Score	9.3	# 112
Visual Reasoning	Winoground	InstructBLIP-ZS-CoT	Image Score	16.3	# 61
Visual Reasoning	Winoground	InstructBLIP-ZS-CoT	Group Score	4.0	# 93
Visual Reasoning	Winoground	InstructBLIP	Text Score	7.0	# 113
Visual Reasoning	Winoground	InstructBLIP	Image Score	11.5	# 86
Visual Reasoning	Winoground	InstructBLIP	Group Score	3.3	# 100

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/compositional-chain-of-thought-prompting-for/visual-reasoning-on-winoground)](https://paperswithcode.com/sota/visual-reasoning-on-winoground?p=compositional-chain-of-thought-prompting-for)`

Compositional Chain-of-Thought Prompting for Large Multimodal Models

27 Nov 2023 · Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig ·

The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs. Code: https://github.com/chancharikmitra/CCoT

PDF Abstract

Code

Add Remove Mark official

chancharikmitra/ccot official

Tasks

Add Remove

Language Modelling

Large Language Model

Visual Reasoning

Datasets

Visual Genome

MMBench Winoground

SEED-Bench LLaVA-Bench

Results from the Paper

Edit

Ranked #30 on Visual Reasoning on Winoground

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Reasoning	Winoground	LLaVA-1.5-CCoT	Text Score	42.0	# 30	Compare
			Image Score	35.5	# 16	Compare
			Group Score	22.3	# 23	Compare
Visual Reasoning	Winoground	LLaVA-1.5-ZS-CoT	Text Score	28.0	# 78	Compare
			Image Score	22.5	# 43	Compare
			Group Score	12.3	# 55	Compare
Visual Reasoning	Winoground	LLaVA-1.5	Text Score	36.0	# 48	Compare
			Image Score	33.3	# 17	Compare
			Group Score	20.1	# 30	Compare
Visual Reasoning	Winoground	InstructBLIP-CCoT	Text Score	21.0	# 96	Compare
			Image Score	21.3	# 47	Compare
			Group Score	8.3	# 77	Compare
Visual Reasoning	Winoground	InstructBLIP-ZS-CoT	Text Score	9.3	# 112	Compare
			Image Score	16.3	# 61	Compare
			Group Score	4.0	# 93	Compare
Visual Reasoning	Winoground	InstructBLIP	Text Score	7.0	# 113	Compare
			Image Score	11.5	# 86	Compare
			Group Score	3.3	# 100	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove