Image-to-Text Retrieval

28 papers with code • 8 benchmarks • 8 datasets

Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.

Benchmarks

Add a Result

These leaderboards are used to track progress in Image-to-Text Retrieval

Dataset	Best Model	Compare
Flickr30k	InternVL-G-FT (finetuned, w/o ranking)	See all
MS COCO	BLIP-2 ViT-G (fine-tuned)	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Text-only FT)	See all
AIC-ICC	ERNIE-ViL2.0	See all
RUC-CAS-WenLan	CMCL	See all
Localized Narratives	OPT	See all
FETA Car-Manuals	FETA's CLIP-MIL (Many-Shot Image-to-text)	See all
RSICD	GeoRSCLIP-FT	See all

Libraries

Use these libraries to find Image-to-Text Retrieval models and implementations

facebookresearch/multimodal

3 papers

1,311

huggingface/transformers

2 papers

125,940

salesforce/lavis

2 papers

8,848

Datasets

Most implemented papers

Most implemented Social Latest No code

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis • • 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Paper
Code

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis • • NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Paper
Code

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

microsoft/Oscar • • ECCV 2020

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.

Paper
Code

Deep Visual-Semantic Alignments for Generating Image Descriptions

VinitSR7/Image-Caption-Generation • • CVPR 2015

Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.

Paper
Code

FLAVA: A Foundational Language And Vision Alignment Model

facebookresearch/multimodal • • CVPR 2022

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

Paper
Code

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

e-bug/volta • • 27 Jan 2022

Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

Paper
Code

Exploring Models and Data for Remote Sensing Image Caption Generation

201528014227051/RSICD_optimal • 21 Dec 2017

Finally, a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing caption.

Paper
Code

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

BAAI-WuDao/BriVl • • 11 Mar 2021

We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.

Paper
Code

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

mindspore-ai/models • • 1 Jul 2021

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

Paper
Code

A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval

VL-Group/2022-NeurIPS-DAA • • NeurIPS 2022 2022

To verify the effectiveness of our approach, extensive experiments are conducted on MS-COCO, CUB Captions, and Flickr30K, which are commonly used in cross-modal retrieval.

Paper
Code

Image-to-Text Retrieval

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result