Cross-Modal Information Retrieval

8 papers with code • 0 benchmarks • 0 datasets

Cross-Modal Information Retrieval (CMIR) is the task of finding relevant items across different modalities. For example, given an image, find a text or vice versa. The main challenge in CMIR is known as the heterogeneity gap: since items from different modalities have different data types, the similarity between them cannot be measured directly. Therefore, the majority of CMIR methods published to date attempt to bridge this gap by learning a latent representation space, where the similarity between items from different modalities can be measured.

Source: Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Most implemented papers

Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions

AlexMoreo/tensorflow-Tex2Vis 23 Jun 2016

We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation.

CMIR-NET : A Deep Learning Based Model For Cross-Modal Retrieval In Remote Sensing

ushasi/CMIR-NET-A-deep-learning-based-model-for-cross-modal-retrieval-in-remote-sensing 9 Apr 2019

In particular, we are interested in two application scenarios: i) cross-modal retrieval between panchromatic (PAN) and multi-spectral imagery, and ii) multi-label image retrieval between very high resolution (VHR) images and speech based label annotations.

Cross-modal representation alignment of molecular structure and perturbation-induced transcriptional profiles

sgfin/molecule_ge_coordinated_embeddings 22 Nov 2019

Modeling the relationship between chemical structure and molecular activity is a key goal in drug development.

ZSCRGAN: A GAN-based Expectation Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions

ranarag/ZSCRGAN 23 Jul 2020

Most existing algorithms for cross-modal Information Retrieval are based on a supervised train-test setup, where a model learns to align the mode of the query (e. g., text) to the mode of the documents (e. g., images) from a given training set.

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

mesnico/TERAN 12 Aug 2020

In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level.

Learning the Best Pooling Strategy for Visual Semantic Embedding

woodfrog/vse_infty CVPR 2021

Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions.

VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words

soco-ai/SF-QA ACL 2021

To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for large-scale datasets, with significant accuracy improvement compared to previous state-of-the-art methods.

Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval

yangong23/vsenetworkslseh 10 Oct 2022

Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval.