TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Reasoning	Winoground	GPT-4V + CoCoT	Text Score	58.5	# 4
Visual Reasoning	Winoground	GPT-4V + CoCoT	Image Score	49.5	# 4
Visual Reasoning	Winoground	GPT-4V + CoCoT	Group Score	44.5	# 4
Visual Reasoning	Winoground	GPT-4V	Text Score	54.5	# 6
Visual Reasoning	Winoground	GPT-4V	Image Score	42.5	# 10
Visual Reasoning	Winoground	GPT-4V	Group Score	37.75	# 10
Visual Reasoning	Winoground	Gemini + CoCoT	Text Score	40	# 36
Visual Reasoning	Winoground	Gemini + CoCoT	Image Score	32.5	# 19
Visual Reasoning	Winoground	Gemini + CoCoT	Group Score	27.75	# 16
Visual Reasoning	Winoground	Gemini + CCoT	Text Score	22.5	# 95
Visual Reasoning	Winoground	Gemini + CCoT	Image Score	33	# 18
Visual Reasoning	Winoground	Gemini + CCoT	Group Score	20.75	# 28
Visual Reasoning	Winoground	Gemini + DDCoT	Text Score	45	# 17
Visual Reasoning	Winoground	Gemini + DDCoT	Image Score	25	# 34
Visual Reasoning	Winoground	Gemini + DDCoT	Group Score	23.75	# 19
Visual Reasoning	Winoground	Gemini	Text Score	30.75	# 62
Visual Reasoning	Winoground	Gemini	Image Score	26	# 29
Visual Reasoning	Winoground	Gemini	Group Score	25	# 18
Visual Reasoning	Winoground	MMICL + CoCoT	Text Score	64.25	# 3
Visual Reasoning	Winoground	MMICL + CoCoT	Image Score	52.5	# 3
Visual Reasoning	Winoground	MMICL + CoCoT	Group Score	50.75	# 2
Visual Reasoning	Winoground	MMICL + CCoT	Text Score	51	# 9
Visual Reasoning	Winoground	MMICL + CCoT	Image Score	48	# 5
Visual Reasoning	Winoground	MMICL + CCoT	Group Score	47.5	# 3
Visual Reasoning	Winoground	MMICL + DDCoT	Text Score	46.75	# 12
Visual Reasoning	Winoground	MMICL + DDCoT	Image Score	45	# 8
Visual Reasoning	Winoground	MMICL + DDCoT	Group Score	36.75	# 11
Visual Reasoning	Winoground	OpenFlamingo + CoCoT	Text Score	58.25	# 5
Visual Reasoning	Winoground	OpenFlamingo + CoCoT	Image Score	55.25	# 2
Visual Reasoning	Winoground	OpenFlamingo + CoCoT	Group Score	41.5	# 6
Visual Reasoning	Winoground	OpenFlamingo + CCoT	Text Score	42.5	# 27
Visual Reasoning	Winoground	OpenFlamingo + CCoT	Image Score	27.5	# 24
Visual Reasoning	Winoground	OpenFlamingo + CCoT	Group Score	20	# 31
Visual Reasoning	Winoground	OpenFlamingo + DDCoT	Text Score	47.5	# 10
Visual Reasoning	Winoground	OpenFlamingo + DDCoT	Image Score	47.25	# 6
Visual Reasoning	Winoground	OpenFlamingo + DDCoT	Group Score	39	# 8
Visual Reasoning	Winoground	OpenFlamingo	Text Score	39	# 39
Visual Reasoning	Winoground	OpenFlamingo	Image Score	41.25	# 13
Visual Reasoning	Winoground	OpenFlamingo	Group Score	33.25	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cocot-contrastive-chain-of-thought-prompting/visual-reasoning-on-winoground)](https://paperswithcode.com/sota/visual-reasoning-on-winoground?p=cocot-contrastive-chain-of-thought-prompting)`

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

5 Jan 2024 · Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan YAO, Mingkai Chen, Jiebo Luo ·

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Image Comprehension

Text Matching

Visual Reasoning

Datasets

RAVEN Winoground

Results from the Paper

Add Remove

Ranked #3 on Visual Reasoning on Winoground

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Reasoning	Winoground	GPT-4V + CoCoT	Text Score	58.5	# 4	Compare
			Image Score	49.5	# 4	Compare
			Group Score	44.5	# 4	Compare
Visual Reasoning	Winoground	GPT-4V	Text Score	54.5	# 6	Compare
			Image Score	42.5	# 10	Compare
			Group Score	37.75	# 10	Compare
Visual Reasoning	Winoground	Gemini + CoCoT	Text Score	40	# 36	Compare
			Image Score	32.5	# 19	Compare
			Group Score	27.75	# 16	Compare
Visual Reasoning	Winoground	Gemini + CCoT	Text Score	22.5	# 95	Compare
			Image Score	33	# 18	Compare
			Group Score	20.75	# 28	Compare
Visual Reasoning	Winoground	Gemini + DDCoT	Text Score	45	# 17	Compare
			Image Score	25	# 34	Compare
			Group Score	23.75	# 19	Compare
Visual Reasoning	Winoground	Gemini	Text Score	30.75	# 62	Compare
			Image Score	26	# 29	Compare
			Group Score	25	# 18	Compare
Visual Reasoning	Winoground	MMICL + CoCoT	Text Score	64.25	# 3	Compare
			Image Score	52.5	# 3	Compare
			Group Score	50.75	# 2	Compare
Visual Reasoning	Winoground	MMICL + CCoT	Text Score	51	# 9	Compare
			Image Score	48	# 5	Compare
			Group Score	47.5	# 3	Compare
Visual Reasoning	Winoground	MMICL + DDCoT	Text Score	46.75	# 12	Compare
			Image Score	45	# 8	Compare
			Group Score	36.75	# 11	Compare
Visual Reasoning	Winoground	OpenFlamingo + CoCoT	Text Score	58.25	# 5	Compare
			Image Score	55.25	# 2	Compare
			Group Score	41.5	# 6	Compare
Visual Reasoning	Winoground	OpenFlamingo + CCoT	Text Score	42.5	# 27	Compare
			Image Score	27.5	# 24	Compare
			Group Score	20	# 31	Compare
Visual Reasoning	Winoground	OpenFlamingo + DDCoT	Text Score	47.5	# 10	Compare
			Image Score	47.25	# 6	Compare
			Group Score	39	# 8	Compare
Visual Reasoning	Winoground	OpenFlamingo	Text Score	39	# 39	Compare
			Image Score	41.25	# 13	Compare
			Group Score	33.25	# 12	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove