Zero-Shot Composed Image Retrieval (ZS-CIR)

9 papers with code • 7 benchmarks • 7 datasets

Given a query composed of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images that are visually similar to the reference one but incorporate the changes specified in the relative caption. The bi-modality of the query provides users with more precise control over the characteristics of the desired image, as some features are more easily described with language, while others can be better expressed visually.

Zero-Shot Composed Image Retrieval (ZS-CIR) is a subtask of CIR that aims to design an approach that manages to combine the reference image and the relative caption without the need for supervised learning.

Most implemented papers

Zero-Shot Composed Image Retrieval with Textual Inversion

miccunifi/searle ICCV 2023

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images.

"This is my unicorn, Fluffy": Personalizing frozen vision-language representations

nvlabs/palavra 4 Apr 2022

We propose an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts.

Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval

google-research/composed_image_retrieval CVPR 2023

Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

navervision/compodiff 21 Mar 2023

This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion.

Zero-shot Composed Text-Image Retrieval

Code-kunkun/ZS-CIR 12 Jun 2023

In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e. g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.

CoVR: Learning Composed Video Retrieval from Web Video Captions

lucas-ventura/CoVR 28 Aug 2023

Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image.

Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval

pter61/context-i2w 28 Sep 2023

Different from Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute.

Vision-by-Language for Training-Free Compositional Image Retrieval

explainableml/vision_by_language 13 Oct 2023

Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases.

Language-only Efficient Training of Zero-shot Composed Image Retrieval

navervision/lincir 4 Dec 2023

Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP).