Image-text matching
100 papers with code • 1 benchmarks • 1 datasets
Image-Text Matching is a subtask within Cross-Modal Retrieval (CMR) that involves establishing associations between images and corresponding textual descriptions. The goal is to retrieve an image given a textual query or, conversely, retrieve a textual description given an image query. This task is challenging due to the heterogeneity gap between image and text data representations. Image-text matching is used in applications such as content-based image search, visual question answering, and multimodal summarization.
Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective
Libraries
Use these libraries to find Image-text matching models and implementationsMost implemented papers
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
VinVL: Revisiting Visual Representations in Vision-Language Models
In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
Stacked Cross Attention for Image-Text Matching
Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations.
Dual Attention Networks for Multimodal Reasoning and Matching
We propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language.