Image-to-Text Retrieval

36 papers with code • 8 benchmarks • 8 datasets

Image-text retrieval is the process of retrieving relevant images based on textual descriptions or finding corresponding textual descriptions for a given image. This task is interdisciplinary, combining techniques from computer vision, and natural language processing. The primary challenge lies in bridging the semantic gap — the difference between how visual data is represented in images and how humans describe that information using language. To address this, many methods focus on learning a shared embedding space where both images and text can be represented in a comparable way, allowing their similarities to be measured and facilitating more accurate retrieval.

Source: Extending CLIP for Category-to-Image Retrieval in E-commerce

Libraries

Use these libraries to find Image-to-Text Retrieval models and implementations

Most implemented papers

Learning Transferable Visual Models From Natural Language Supervision

openai/CLIP 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis Conference 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Sigmoid Loss for Language Image Pre-Training

google-research/big_vision ICCV 2023

We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP).

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Deep Visual-Semantic Alignments for Generating Image Descriptions

VinitSR7/Image-Caption-Generation CVPR 2015

Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

microsoft/Oscar ECCV 2020

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.

FLAVA: A Foundational Language And Vision Alignment Model

facebookresearch/multimodal CVPR 2022

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

e-bug/volta 27 Jan 2022

Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

Exploring Models and Data for Remote Sensing Image Caption Generation

201528014227051/RSICD_optimal 21 Dec 2017

Finally, a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing caption.

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

BAAI-WuDao/BriVl 11 Mar 2021

We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.