Text-to-Image Retrieval

9 papers with code • 5 benchmarks • 3 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

ZSCRGAN: A GAN-based Expectation Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions

ranarag/ZSCRGAN 23 Jul 2020

Most existing algorithms for cross-modal Information Retrieval are based on a supervised train-test setup, where a model learns to align the mode of the query (e. g., text) to the mode of the documents (e. g., images) from a given training set.

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

BAAI-WuDao/BriVl 11 Mar 2021

We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.

A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

m2man/LGSGM 4 Jun 2021

In this paper, we introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method by integrating an extra graph convolution network to capture the general information of a graph.

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

mindspore-ai/models 1 Jul 2021

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

zinengtang/VidLanKD NeurIPS 2021

We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.

Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification

zifyloo/SSAN 27 Jul 2021

Third, we introduce a Compound Ranking (CR) loss that makes use of textual descriptions for other images of the same identity to provide extra supervision, thereby effectively reducing the intra-class variance in textual features.

Cross-Modal Coherence for Text-to-Image Retrieval

klory/cross-modal-coherence-for-text-to-image-retrieval 22 Sep 2021

Common image-text joint understanding techniques presume that images and the associated text can universally be characterized by a single implicit model.

One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code

fla-sil/PyTorrent 12 May 2022

Moreover, our model supports self-supervised pretraining with the same sparsely activated way, resulting in better initialized parameters for different modalities.

Fine-grained Image Captioning with CLIP Reward

j-min/clip-caption-reward Findings (NAACL) 2022

Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.