Image-to-Text Retrieval
28 papers with code • 8 benchmarks • 8 datasets
Libraries
Use these libraries to find Image-to-Text Retrieval models and implementationsDatasets
Most implemented papers
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.
Deep Visual-Semantic Alignments for Generating Image Descriptions
Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.
FLAVA: A Foundational Language And Vision Alignment Model
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages
Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.
Exploring Models and Data for Remote Sensing Image Caption Generation
Finally, a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing caption.
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval
To verify the effectiveness of our approach, extensive experiments are conducted on MS-COCO, CUB Captions, and Flickr30K, which are commonly used in cross-modal retrieval.