About

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

Source: Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Benchmarks

TREND DATASET BEST METHOD PAPER TITLE PAPER CODE COMPARE

Datasets

Greatest papers with code

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

20 May 2020alibaba/EasyTransfer

In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry.

CROSS-MODAL RETRIEVAL

Target-Oriented Deformation of Visual-Semantic Embedding Space

15 Oct 2019fartashf/vsepp

Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation.

CROSS-MODAL RETRIEVAL

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

18 Jul 2017fartashf/vsepp

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.

CROSS-MODAL RETRIEVAL IMAGE RETRIEVAL STRUCTURED PREDICTION

Stacked Cross Attention for Image-Text Matching

ECCV 2018 kuanghuei/SCAN

Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.

CROSS-MODAL RETRIEVAL IMAGE RETRIEVAL TEXT MATCHING

Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations

CVPR 2019 vacancy/SceneGraphParser

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics.

CROSS-MODAL RETRIEVAL

Visual Semantic Reasoning for Image-Text Matching

ICCV 2019 KunpengLi1994/VSRN

It outperforms the current best method by 6. 8% relatively for image retrieval and 4. 8% relatively for caption retrieval on MS-COCO (Recall@1 using 1K test set).

CROSS-MODAL RETRIEVAL IMAGE RETRIEVAL TEXT MATCHING

Order-Embeddings of Images and Language

19 Nov 2015ivendrov/order-embedding

Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images.

CROSS-MODAL RETRIEVAL IMAGE CAPTIONING NATURAL LANGUAGE INFERENCE

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

NeurIPS 2020 gingsi/coot-videotext

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.

CROSS-MODAL RETRIEVAL REPRESENTATION LEARNING VIDEO-TEXT RETRIEVAL

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

CVPR 2020 cshizhe/hgr_v2t

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

CROSS-MODAL RETRIEVAL TEXT MATCHING VIDEO-TEXT RETRIEVAL