Cross-Modal Information Retrieval

8 papers with code • 0 benchmarks • 0 datasets

Cross-Modal Information Retrieval (CMIR) is the task of finding relevant items across different modalities. For example, given an image, find a text or vice versa. The main challenge in CMIR is known as the heterogeneity gap: since items from different modalities have different data types, the similarity between them cannot be measured directly. Therefore, the majority of CMIR methods published to date attempt to bridge this gap by learning a latent representation space, where the similarity between items from different modalities can be measured.

Source: Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Benchmarks

Add a Result

These leaderboards are used to track progress in Cross-Modal Information Retrieval

You can find evaluation results in the subtasks. You can also submitting evaluation metrics for this task.

Subtasks

Cross-Modal Retrieval

Most implemented papers

Most implemented Social Latest No code

Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions

AlexMoreo/tensorflow-Tex2Vis • • 23 Jun 2016

We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation.

Paper
Code

CMIR-NET : A Deep Learning Based Model For Cross-Modal Retrieval In Remote Sensing

ushasi/CMIR-NET-A-deep-learning-based-model-for-cross-modal-retrieval-in-remote-sensing • • 9 Apr 2019

In particular, we are interested in two application scenarios: i) cross-modal retrieval between panchromatic (PAN) and multi-spectral imagery, and ii) multi-label image retrieval between very high resolution (VHR) images and speech based label annotations.

Paper
Code

Cross-modal representation alignment of molecular structure and perturbation-induced transcriptional profiles

sgfin/molecule_ge_coordinated_embeddings • • 22 Nov 2019

Modeling the relationship between chemical structure and molecular activity is a key goal in drug development.

Paper
Code

ZSCRGAN: A GAN-based Expectation Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions

ranarag/ZSCRGAN • • 23 Jul 2020

Most existing algorithms for cross-modal Information Retrieval are based on a supervised train-test setup, where a model learns to align the mode of the query (e. g., text) to the mode of the documents (e. g., images) from a given training set.

Paper
Code

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

mesnico/TERAN • • 12 Aug 2020

In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level.

Paper
Code

Learning the Best Pooling Strategy for Visual Semantic Embedding

woodfrog/vse_infty • • CVPR 2021

Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions.

Paper
Code

VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words

soco-ai/SF-QA • • ACL 2021

To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for large-scale datasets, with significant accuracy improvement compared to previous state-of-the-art methods.

Paper
Code

Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval

yangong23/vsenetworkslseh • • 10 Oct 2022

Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval.

Paper
Code

Cross-Modal Information Retrieval

Benchmarks Add a Result

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result