Cross-Modal Retrieval

116 papers with code • 5 benchmarks • 15 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

Source: Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval


Use these libraries to find Cross-Modal Retrieval models and implementations

Most implemented papers

Stacked Capsule Autoencoders

google-research/google-research NeurIPS 2019

In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses.

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

fartashf/vsepp 18 Jul 2017

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.

Rescaling Egocentric Vision

epic-kitchens/epic-kitchens-100-annotations 23 Jun 2020

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS.

Stacked Cross Attention for Image-Text Matching

kuanghuei/SCAN ECCV 2018

Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

dandelin/vilt 5 Feb 2021

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

cshizhe/hgr_v2t CVPR 2020

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

microsoft/Oscar ECCV 2020

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Deep Visual-Semantic Alignments for Generating Image Descriptions

VinitSR7/Image-Caption-Generation CVPR 2015

Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

alibaba/EasyNLP 20 May 2020

In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry.