Zero-Shot Cross-Modal Retrieval

19 papers with code • 2 benchmarks • 4 datasets

Zero-Shot Cross-Modal Retrieval is the task of finding relevant items across different modalities without having received any training examples. For example, given an image, find a text or vice versa. The main challenge in the task is known as the heterogeneity gap: since items from different modalities have different data types, the similarity between them cannot be measured directly. Therefore, the majority of methods published to date attempt to bridge this gap by learning a latent representation space, where the similarity between items from different modalities can be measured.

Source: Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Benchmarks

Add a Result

These leaderboards are used to track progress in Zero-Shot Cross-Modal Retrieval

Trend	Dataset	Best Model	Paper	Code	Compare
	Flickr30k	InternVL-G			See all
	COCO 2014	InternVL-G			See all

Libraries

Use these libraries to find Zero-Shot Cross-Modal Retrieval models and implementations

mlfoundations/open_clip

3 papers

8,561

facebookresearch/multimodal

2 papers

1,307

Datasets

Most implemented papers

Most implemented Social Latest No code

Learning Transferable Visual Models From Natural Language Supervision

openai/CLIP • • 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

Paper
Code

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER • • ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Paper
Code

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

dandelin/vilt • • 5 Feb 2021

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.

Paper
Code

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis • • NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Paper
Code

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo • • DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

Paper
Code

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip • • 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

Paper
Code

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

kakaobrain/coyo-dataset • • 11 Feb 2021

In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.

Paper
Code

Reproducible scaling laws for contrastive language-image learning

laion-ai/scaling-laws-openclip • • CVPR 2023

To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.

Paper
Code

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

microsoft/unilm • • 22 Aug 2022

A big convergence of language, vision, and multimodal pretraining is emerging.

Paper
Code

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

google-research/google-research • • CVPR 2023

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.

Paper
Code

Zero-Shot Cross-Modal Retrieval

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result