Image-to-Text Retrieval

28 papers with code • 8 benchmarks • 8 datasets

Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.

Benchmarks

Add a Result

These leaderboards are used to track progress in Image-to-Text Retrieval

Dataset	Best Model	Compare
Flickr30k	InternVL-G-FT (finetuned, w/o ranking)	See all
MS COCO	BLIP-2 ViT-G (fine-tuned)	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Text-only FT)	See all
AIC-ICC	ERNIE-ViL2.0	See all
RUC-CAS-WenLan	CMCL	See all
Localized Narratives	OPT	See all
FETA Car-Manuals	FETA's CLIP-MIL (Many-Shot Image-to-text)	See all
RSICD	GeoRSCLIP-FT	See all

Libraries

Use these libraries to find Image-to-Text Retrieval models and implementations

facebookresearch/multimodal

3 papers

1,287

huggingface/transformers

2 papers

124,593

salesforce/lavis

2 papers

8,685

Datasets

Latest papers with no code

Most implemented Social Latest No code

CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?

no code yet • 7 Mar 2024

Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e. g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77. 5%!

Paper
Add Code

Towards a Visual-Language Foundation Model for Computational Pathology

no code yet • 24 Jul 2023

The accelerated adoption of digital pathology and advances in deep learning have enabled the development of powerful models for various pathology tasks across a diverse array of diseases and patient cohorts.

Paper
Add Code

Is Cross-modal Information Retrieval Possible without Training?

no code yet • 20 Apr 2023

Embeddings for a particular modality of data occupy a high-dimensional space of its own, but it can be semantically aligned to another by a simple mapping without training a deep neural net.

Paper
Add Code

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

no code yet • ICCV 2023

We introduce WHOOPS!, a new dataset and benchmark for visual commonsense.

Paper
Add Code

When are Lemons Purple? The Concept Association Bias of Vision-Language Models

no code yet • 22 Dec 2022

Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval.

Paper
Add Code

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.

Paper
Add Code

Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval

no code yet • 29 Jul 2022

When we do online paired data augmentation, we first generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module to output the latent codes, which are finally fed to StyleGAN2 to generate the augmented images.

Paper
Add Code

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

no code yet • CVPR 2022

Under a fair comparison setting, our COTS achieves the highest performance among all two-stream methods and comparable performance (but with 10, 800X faster in inference) w. r. t.

Paper
Add Code

Hierarchical Gumbel Attention Network for Text-based Person Search

no code yet • 10 Oct 2020

This hard selection strategy is able to fuse the strong-relevant multi-modality features for alleviating the problem of matching redundancy.

Paper
Add Code

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

no code yet • 16 Aug 2019

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner.

Paper
Add Code

Image-to-Text Retrieval

Benchmarks Add a Result

Libraries

Datasets

Latest papers with no code

Content

Benchmarks

Add a Result