Zero-Shot Cross-Modal Retrieval

22 papers with code • 3 benchmarks • 5 datasets

Zero-Shot Cross-Modal Retrieval is the task of finding relevant items across different modalities without having received any training examples. For example, given an image, find a text or vice versa. This task presents a unique challenge known as the heterogeneity gap, which arises because items from different modalities (such as text and images) have inherently different data types. As a result, measuring similarity between these modalities directly is difficult. To address this, most current approaches aim to bridge the heterogeneity gap by learning a shared latent representation space. In this space, data from different modalities are projected into a common representation, where similarity between items, regardless of modality, can be directly measured.

Source: Extending CLIP for Category-to-image Retrieval in E-commerce

Libraries

Use these libraries to find Zero-Shot Cross-Modal Retrieval models and implementations

Most implemented papers

Learning Transferable Visual Models From Natural Language Supervision

openai/CLIP 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

dandelin/vilt 5 Feb 2021

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

facebookresearch/metaclip 11 Feb 2021

In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.

Reproducible scaling laws for contrastive language-image learning

laion-ai/scaling-laws-openclip CVPR 2023

To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.

Florence: A New Foundation Model for Computer Vision

microsoft/unicl 22 Nov 2021

Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

microsoft/unilm 22 Aug 2022

A big convergence of language, vision, and multimodal pretraining is emerging.