CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

5 Jan 2024  ·  Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan YAO, Mingkai Chen, Jiebo Luo ·

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Visual Reasoning Winoground GPT-4V + CoCoT Text Score 58.5 # 4
Image Score 49.5 # 4
Group Score 44.5 # 4
Visual Reasoning Winoground GPT-4V Text Score 54.5 # 6
Image Score 42.5 # 10
Group Score 37.75 # 10
Visual Reasoning Winoground Gemini + CoCoT Text Score 40 # 36
Image Score 32.5 # 19
Group Score 27.75 # 16
Visual Reasoning Winoground Gemini + CCoT Text Score 22.5 # 95
Image Score 33 # 18
Group Score 20.75 # 28
Visual Reasoning Winoground Gemini + DDCoT Text Score 45 # 17
Image Score 25 # 34
Group Score 23.75 # 19
Visual Reasoning Winoground Gemini Text Score 30.75 # 62
Image Score 26 # 29
Group Score 25 # 18
Visual Reasoning Winoground MMICL + CoCoT Text Score 64.25 # 3
Image Score 52.5 # 3
Group Score 50.75 # 2
Visual Reasoning Winoground MMICL + CCoT Text Score 51 # 9
Image Score 48 # 5
Group Score 47.5 # 3
Visual Reasoning Winoground MMICL + DDCoT Text Score 46.75 # 12
Image Score 45 # 8
Group Score 36.75 # 11
Visual Reasoning Winoground OpenFlamingo + CoCoT Text Score 58.25 # 5
Image Score 55.25 # 2
Group Score 41.5 # 6
Visual Reasoning Winoground OpenFlamingo + CCoT Text Score 42.5 # 27
Image Score 27.5 # 24
Group Score 20 # 31
Visual Reasoning Winoground OpenFlamingo + DDCoT Text Score 47.5 # 10
Image Score 47.25 # 6
Group Score 39 # 8
Visual Reasoning Winoground OpenFlamingo Text Score 39 # 39
Image Score 41.25 # 13
Group Score 33.25 # 12

Methods


No methods listed for this paper. Add relevant methods here