Composed Image Retrieval for Training-Free Domain Conversion

This work addresses composed image retrieval in the context of domain conversion, where the content of a query image is retrieved in the domain specified by the query text. We show that a strong vision-language model provides sufficient descriptive power without additional training. The query image is mapped to the text input space using textual inversion. Unlike common practice that invert in the continuous space of text tokens, we use the discrete word space via a nearest-neighbor search in a text vocabulary. With this inversion, the image is softly mapped across the vocabulary and is made more robust using retrieval-based augmentation. Database images are retrieved by a weighted ensemble of text queries combining mapped words with the domain text. Our method outperforms prior art by a large margin on standard and newly introduced benchmarks. Code: https://github.com/NikosEfth/freedom

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-Shot Composed Image Retrieval (ZS-CIR) ImageNet-R Pic2Word (CLIP-L/14) mAP 7.88 # 6
Zero-Shot Composed Image Retrieval (ZS-CIR) ImageNet-R CompoDiff (CLIP-L/14) mAP 12.88 # 3
Zero-Shot Composed Image Retrieval (ZS-CIR) ImageNet-R WeiCom (CLIP-L/14) mAP 10.47 # 4
Zero-Shot Composed Image Retrieval (ZS-CIR) ImageNet-R SEARLE (CLIP-L/14) mAP 14.04 # 2
Zero-Shot Composed Image Retrieval (ZS-CIR) ImageNet-R MagicLens (CLIP-L/14) mAP 9.13 # 5
Zero-Shot Composed Image Retrieval (ZS-CIR) ImageNet-R FreeDom (CLIP-L/14) mAP 29.91 # 1
Zero-Shot Composed Image Retrieval (ZS-CIR) Large Time Lags Location (LTLL) FreeDom (CLIP-L/14) mAP 33.24 # 1
Zero-Shot Composed Image Retrieval (ZS-CIR) Large Time Lags Location (LTLL) MagicLens (CLIP-L/14) mAP 24.21 # 4
Zero-Shot Composed Image Retrieval (ZS-CIR) Large Time Lags Location (LTLL) SEARLE (CLIP-L/14) mAP 25.46 # 3
Zero-Shot Composed Image Retrieval (ZS-CIR) Large Time Lags Location (LTLL) WeiCom (CLIP-L/14) mAP 26.6 # 2
Zero-Shot Composed Image Retrieval (ZS-CIR) Large Time Lags Location (LTLL) CompoDiff (CLIP-L/14) mAP 21.61 # 5
Zero-Shot Composed Image Retrieval (ZS-CIR) Large Time Lags Location (LTLL) Pic2Word (CLIP-L/14) mAP 21.27 # 6
Zero-Shot Composed Image Retrieval (ZS-CIR) MiniDomainNet MagicLens (CLIP-L/14) mAP 20.06 # 4
Zero-Shot Composed Image Retrieval (ZS-CIR) MiniDomainNet SEARLE (CLIP-L/14) mAP 21.78 # 3
Zero-Shot Composed Image Retrieval (ZS-CIR) MiniDomainNet WeiCom (CLIP-L/14) mAP 8.52 # 6
Zero-Shot Composed Image Retrieval (ZS-CIR) MiniDomainNet CompoDiff (CLIP-L/14) mAP 22.95 # 2
Zero-Shot Composed Image Retrieval (ZS-CIR) MiniDomainNet Pic2Word (CLIP-L/14) mAP 12 # 5
Zero-Shot Composed Image Retrieval (ZS-CIR) MiniDomainNet FreeDom (CLIP-L/14) mAP 37.27 # 1
Zero-Shot Composed Image Retrieval (ZS-CIR) NICO++ Pic2Word (CLIP-L/14) mAP 9.76 # 6
Zero-Shot Composed Image Retrieval (ZS-CIR) NICO++ CompoDiff (CLIP-L/14) mAP 10.32 # 5
Zero-Shot Composed Image Retrieval (ZS-CIR) NICO++ WeiCom (CLIP-L/14) mAP 10.54 # 4
Zero-Shot Composed Image Retrieval (ZS-CIR) NICO++ SEARLE (CLIP-L/14) mAP 15.13 # 3
Zero-Shot Composed Image Retrieval (ZS-CIR) NICO++ MagicLens (CLIP-L/14) mAP 19.66 # 2
Zero-Shot Composed Image Retrieval (ZS-CIR) NICO++ FreeDom (CLIP-L/14) mAP 26.1 # 1

Methods


No methods listed for this paper. Add relevant methods here