Zero-shot Image Retrieval
16 papers with code • 5 benchmarks • 6 datasets
Most implemented papers
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing
Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval.
Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval
Different from Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute.
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence.