Cross-Modal Retrieval
237 papers with code • 13 benchmarks • 24 datasets
Cross-Modal Retrieval (CMR) is a task of retrieving items across different modalities, such as image, text, video, and audio. The core challenge of CMR is the heterogeneity gap, which arises because data from different modalities have distinct representations, making direct comparison difficult. To address this, most CMR methods focus on learning a shared latent embedding space. In this space, concepts from different modalities are projected, allowing their similarity to be measured using a distance metric.
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
Libraries
Use these libraries to find Cross-Modal Retrieval models and implementationsDatasets
Subtasks
Most implemented papers
Stacked Capsule Autoencoders
In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses.
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.
Rescaling Egocentric Vision
This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS.
Stacked Cross Attention for Image-Text Matching
Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.
Deep Visual-Semantic Alignments for Generating Image Descriptions
Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.