Zero-Shot Cross-Modal Retrieval

18 papers with code • 2 benchmarks • 4 datasets

Zero-Shot Cross-Modal Retrieval is the task of finding relevant items across different modalities without having received any training examples. For example, given an image, find a text or vice versa. The main challenge in the task is known as the heterogeneity gap: since items from different modalities have different data types, the similarity between them cannot be measured directly. Therefore, the majority of methods published to date attempt to bridge this gap by learning a latent representation space, where the similarity between items from different modalities can be measured.

Source: Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study


Use these libraries to find Zero-Shot Cross-Modal Retrieval models and implementations

Most implemented papers

Learning Transferable Visual Models From Natural Language Supervision

openai/CLIP 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

dandelin/vilt 5 Feb 2021

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

kakaobrain/coyo-dataset 11 Feb 2021

In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

Reproducible scaling laws for contrastive language-image learning

laion-ai/scaling-laws-openclip CVPR 2023

To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

google-research/google-research CVPR 2023

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.

Florence: A New Foundation Model for Computer Vision

microsoft/unicl 22 Nov 2021

Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.