Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code will be available at https://github.com/Pter61/osrcir2024/.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRCO OSrCIR (CLIP G/14) mAP@10 31.14 # 10
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRCO OSrCIR (CLIP B/32) mAP@10 19.17 # 24
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRCO OSrCIR (CLIP L/14) mAP@10 25.33 # 17
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRR OSrCIR (CLIP B/32) R@5 54.54 # 33
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRR OSrCIR (CLIP G/14) R@5 67.25 # 7
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRR OSrCIR (CLIP L/14) R@5 57.68 # 22
Zero-Shot Composed Image Retrieval (ZS-CIR) Fashion IQ OSrCIR (CLIP G/14) (Recall@10+Recall@50)/2 47.34 # 7
Zero-Shot Composed Image Retrieval (ZS-CIR) Fashion IQ OSrCIR (CLIP L/14) (Recall@10+Recall@50)/2 42.82 # 18
Zero-Shot Composed Image Retrieval (ZS-CIR) Fashion IQ OSrCIR (CLIP B/32) (Recall@10+Recall@50)/2 42.87 # 17
Zero-Shot Composed Image Retrieval (ZS-CIR) GeneCIS OSrCIR (CLIP B/32) A-R@1 17.4 # 3
Zero-Shot Composed Image Retrieval (ZS-CIR) GeneCIS OSrCIR (CLIP L/14) A-R@1 17.9 # 2
Zero-Shot Composed Image Retrieval (ZS-CIR) GeneCIS OSrCIR (CLIP G/14) A-R@1 19.6 # 1

Methods


No methods listed for this paper. Add relevant methods here