Cross-Modal Retrieval

190 papers with code • 12 benchmarks • 20 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Benchmarks

Add a Result

These leaderboards are used to track progress in Cross-Modal Retrieval

Dataset	Best Model	Compare
COCO 2014	OURS-COMBINED-VAL	See all
Flickr30k	X2-VLM (large)	See all
Recipe1M	VLPCook (R1M+)	See all
RSICD	GeoRSCLIP-FT	See all
RSITMD	GeoRSCLIP-FT	See all
ChEBI-20	All-Ensemble	See all
Recipe1M+	VLPCook	See all
MSCOCO-1k	NAPReg	See all
SoundingEarth	GeoCLAP	See all
CUHK-PEDES	Dual Path	See all
Flickr-8k	NAPReg	See all
MS-COCO-2014	NAPReg	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Cadene/recipe1m.bootstrap.pytorch

2 papers

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language

bingsu12/momu • • 12 Sep 2022

Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality.

Paper
Code

Deep Visual-Semantic Alignments for Generating Image Descriptions

VinitSR7/Image-Caption-Generation • • CVPR 2015

Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.

Paper
Code

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

alibaba/EasyNLP • • 20 May 2020

In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry.

Paper
Code

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

PaddlePaddle/PaddleNLP • • ACL 2021

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other.

Paper
Code

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

e-bug/volta • • 27 Jan 2022

Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

Paper
Code

A Channel Mix Method for Fine-Grained Cross-Modal Retrieval

msfuxian/A_CHANNEL_MIX_METHOD • • IEEE International Conference on Multimedia and Expo (ICME) 2022

In this paper, we propose a simple but effective method for dealing with the challenging fine-grained cross-modal retrieval task where it aims to enable flexible retrieval among subor-dinate categories across different modalities.

Paper
Code

Order-Embeddings of Images and Language

ivendrov/order-embedding • 19 Nov 2015

Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images.

Paper
Code

Deep Cross-Modal Hashing

jiangqy/DCMH-CVPR2017 • CVPR 2017

Due to its low storage cost and fast query speed, cross-modal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications.

Paper
Code

Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions

AlexMoreo/tensorflow-Tex2Vis • • 23 Jun 2016

We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation.

Paper
Code

Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint

csehong/VM-NET • • 22 Apr 2017

Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa.

Paper
Code

Cross-Modal Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result