Cross-Modal Retrieval

80 papers with code • 3 benchmarks • 12 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

Source: Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Greatest papers with code

Stacked Capsule Autoencoders

google-research/google-research NeurIPS 2019

In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses.

Cross-Modal Retrieval Unsupervised MNIST

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

PaddlePaddle/PaddleNLP ACL 2021

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other.

Contrastive Learning Cross-Modal Retrieval

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

yehli/xmodaler 18 Aug 2021

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

Cross-Modal Retrieval Image Captioning +4

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

microsoft/Oscar ECCV 2020

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.

Cross-Modal Retrieval Image Captioning +3

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

alibaba/EasyTransfer 20 May 2020

In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry.

Cross-Modal Retrieval

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

dandelin/vilt 5 Feb 2021

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.

Cross-Modal Retrieval Visual Question Answering +2

Target-Oriented Deformation of Visual-Semantic Embedding Space

fartashf/vsepp 15 Oct 2019

Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation.

Cross-Modal Retrieval Translation

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

fartashf/vsepp 18 Jul 2017

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.

Cross-Modal Retrieval Fine-tuning +2

Stacked Cross Attention for Image-Text Matching

kuanghuei/SCAN ECCV 2018

Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.

Cross-Modal Retrieval Image Retrieval +1

Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations

vacancy/SceneGraphParser CVPR 2019

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics.

Contrastive Learning Cross-Modal Retrieval