Zero-shot Image Retrieval

16 papers with code • 5 benchmarks • 6 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Zero-shot Image Retrieval

Dataset	Best Model	Compare
Flickr30k-CN	M2-Encoder	See all
COCO-CN	M2-Encoder	See all
MUGE Retrieval	CN-CLIP (ViT-H/14)	See all
XTD10	InternVL-G	See all
ImageNet-R	Pic2Word	See all

Datasets

Most implemented papers

Most implemented Social Latest No code

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

flagai-open/flagai • • 12 Nov 2022

In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.

Paper
Code

Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval

google-research/composed_image_retrieval • • CVPR 2023

Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.

Paper
Code

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

zhuang-li/factual • • 27 May 2023

Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval.

Paper
Code

Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval

pter61/context-i2w • • 28 Sep 2023

Different from Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute.

Paper
Code

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

alipay/Ant-Multi-Modal-Framework • • 29 Jan 2024

Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence.

Paper
Code

Zero-shot Image Retrieval

Benchmarks Add a Result

Datasets

Most implemented papers

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Content

Benchmarks

Add a Result