Zero-Shot Composed Image Retrieval (ZS-CIR)
9 papers with code • 7 benchmarks • 7 datasets
Given a query composed of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images that are visually similar to the reference one but incorporate the changes specified in the relative caption. The bi-modality of the query provides users with more precise control over the characteristics of the desired image, as some features are more easily described with language, while others can be better expressed visually.
Zero-Shot Composed Image Retrieval (ZS-CIR) is a subtask of CIR that aims to design an approach that manages to combine the reference image and the relative caption without the need for supervised learning.
Most implemented papers
Zero-Shot Composed Image Retrieval with Textual Inversion
Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images.
"This is my unicorn, Fluffy": Personalizing frozen vision-language representations
We propose an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts.
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion.
Zero-shot Composed Text-Image Retrieval
In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e. g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
CoVR: Learning Composed Video Retrieval from Web Video Captions
Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image.
Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval
Different from Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute.
Vision-by-Language for Training-Free Compositional Image Retrieval
Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases.
Language-only Efficient Training of Zero-shot Composed Image Retrieval
Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP).