Cross-Modal Information Retrieval
8 papers with code • 0 benchmarks • 0 datasets
Cross-Modal Information Retrieval (CMIR) is the task of finding relevant items across different modalities. For example, given an image, find a text or vice versa. The main challenge in CMIR is known as the heterogeneity gap: since items from different modalities have different data types, the similarity between them cannot be measured directly. Therefore, the majority of CMIR methods published to date attempt to bridge this gap by learning a latent representation space, where the similarity between items from different modalities can be measured.
Source: Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
Benchmarks
These leaderboards are used to track progress in Cross-Modal Information Retrieval
Most implemented papers
Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions
We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation.
CMIR-NET : A Deep Learning Based Model For Cross-Modal Retrieval In Remote Sensing
In particular, we are interested in two application scenarios: i) cross-modal retrieval between panchromatic (PAN) and multi-spectral imagery, and ii) multi-label image retrieval between very high resolution (VHR) images and speech based label annotations.
Cross-modal representation alignment of molecular structure and perturbation-induced transcriptional profiles
Modeling the relationship between chemical structure and molecular activity is a key goal in drug development.
ZSCRGAN: A GAN-based Expectation Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions
Most existing algorithms for cross-modal Information Retrieval are based on a supervised train-test setup, where a model learns to align the mode of the query (e. g., text) to the mode of the documents (e. g., images) from a given training set.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders
In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level.
Learning the Best Pooling Strategy for Visual Semantic Embedding
Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions.
VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words
To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for large-scale datasets, with significant accuracy improvement compared to previous state-of-the-art methods.
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval
Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval.