Image-to-Text Retrieval

28 papers with code • 8 benchmarks • 8 datasets

Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.

Libraries

Use these libraries to find Image-to-Text Retrieval models and implementations

Latest papers with no code

CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?

no code yet • 7 Mar 2024

Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e. g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77. 5%!

Towards a Visual-Language Foundation Model for Computational Pathology

no code yet • 24 Jul 2023

The accelerated adoption of digital pathology and advances in deep learning have enabled the development of powerful models for various pathology tasks across a diverse array of diseases and patient cohorts.

Is Cross-modal Information Retrieval Possible without Training?

no code yet • 20 Apr 2023

Embeddings for a particular modality of data occupy a high-dimensional space of its own, but it can be semantically aligned to another by a simple mapping without training a deep neural net.

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

no code yet • ICCV 2023

We introduce WHOOPS!, a new dataset and benchmark for visual commonsense.

When are Lemons Purple? The Concept Association Bias of Vision-Language Models

no code yet • 22 Dec 2022

Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval.

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.

Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval

no code yet • 29 Jul 2022

When we do online paired data augmentation, we first generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module to output the latent codes, which are finally fed to StyleGAN2 to generate the augmented images.

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

no code yet • CVPR 2022

Under a fair comparison setting, our COTS achieves the highest performance among all two-stream methods and comparable performance (but with 10, 800X faster in inference) w. r. t.

Hierarchical Gumbel Attention Network for Text-based Person Search

no code yet • 10 Oct 2020

This hard selection strategy is able to fuse the strong-relevant multi-modality features for alleviating the problem of matching redundancy.

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

no code yet • 16 Aug 2019

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner.