Cross-Modal Retrieval

190 papers with code • 12 benchmarks • 20 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Most implemented papers

A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language

bingsu12/momu 12 Sep 2022

Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality.

Deep Visual-Semantic Alignments for Generating Image Descriptions

VinitSR7/Image-Caption-Generation CVPR 2015

Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

alibaba/EasyNLP 20 May 2020

In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry.

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

PaddlePaddle/PaddleNLP ACL 2021

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other.

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

e-bug/volta 27 Jan 2022

Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

A Channel Mix Method for Fine-Grained Cross-Modal Retrieval

msfuxian/A_CHANNEL_MIX_METHOD IEEE International Conference on Multimedia and Expo (ICME) 2022

In this paper, we propose a simple but effective method for dealing with the challenging fine-grained cross-modal retrieval task where it aims to enable flexible retrieval among subor-dinate categories across different modalities.

Order-Embeddings of Images and Language

ivendrov/order-embedding 19 Nov 2015

Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images.

Deep Cross-Modal Hashing

jiangqy/DCMH-CVPR2017 CVPR 2017

Due to its low storage cost and fast query speed, cross-modal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications.

Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions

AlexMoreo/tensorflow-Tex2Vis 23 Jun 2016

We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation.

Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint

csehong/VM-NET 22 Apr 2017

Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa.