Image-to-Text Retrieval
28 papers with code • 8 benchmarks • 8 datasets
Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.
Libraries
Use these libraries to find Image-to-Text Retrieval models and implementationsDatasets
Latest papers
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.
Negative Pre-aware for Noisy Cross-modal Matching
Since clean samples are easier distinguished by GMM with increasing noise, the memory bank can still maintain high quality at a high noise ratio.
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
Vision-Language Dataset Distillation
In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching.
PRIOR: Prototype Representation Joint Learning from Medical Images and Reports
In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports.
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions.
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is relatively under-explored.
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
In this work, we explore a scalable way for building a general representation model toward unlimited modalities.
Rethinking Benchmarks for Cross-modal Image-text Retrieval
The reason is that a large amount of images and texts in the benchmarks are coarse-grained.
UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers
Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities.