Image-to-Text Retrieval

28 papers with code • 8 benchmarks • 8 datasets

Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.

Benchmarks

Add a Result

These leaderboards are used to track progress in Image-to-Text Retrieval

Dataset	Best Model	Compare
Flickr30k	InternVL-G-FT (finetuned, w/o ranking)	See all
MS COCO	BLIP-2 ViT-G (fine-tuned)	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Text-only FT)	See all
AIC-ICC	ERNIE-ViL2.0	See all
RUC-CAS-WenLan	CMCL	See all
Localized Narratives	OPT	See all
FETA Car-Manuals	FETA's CLIP-MIL (Many-Shot Image-to-text)	See all
RSICD	GeoRSCLIP-FT	See all

Libraries

Use these libraries to find Image-to-Text Retrieval models and implementations

facebookresearch/multimodal

3 papers

1,291

huggingface/transformers

2 papers

124,984

salesforce/lavis

2 papers

8,724

Datasets

Latest papers

Most implemented Social Latest No code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl • • 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

844

21 Dec 2023

Paper
Code

Negative Pre-aware for Noisy Cross-modal Matching

zhangxu0963/npc • • 10 Dec 2023

Since clean samples are easier distinguished by GMM with increasing noise, the memory bank can still maintain high quality at a high noise ratio.

10 Dec 2023

Paper
Code

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

leolee99/pau • • NeurIPS 2023

In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.

29 Sep 2023

Paper
Code

Vision-Language Dataset Distillation

Guang000/Awesome-Dataset-Distillation • 15 Aug 2023

In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching.

1,164

15 Aug 2023

Paper
Code

PRIOR: Prototype Representation Joint Learning from Medical Images and Reports

qtacierp/prior • • ICCV 2023

In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports.

24 Jul 2023

Paper
Code

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

om-ai-lab/rs5m • • 20 Jun 2023

Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions.

155

20 Jun 2023

Paper
Code

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

sdc17/crossget • 27 May 2023

Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is relatively under-explored.

27 May 2023

Paper
Code