Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry.
Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation.
Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.
Ranked #5 on Cross-Modal Retrieval on Flickr30k
In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space.
Ranked #1 on NLP based Person Retrival on CUHK-PEDES (R@1 metric)
We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics.
It outperforms the current best method by 6. 8% relatively for image retrieval and 4. 8% relatively for caption retrieval on MS-COCO (Recall@1 using 1K test set).
Ranked #4 on Cross-Modal Retrieval on COCO 2014
Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images.
Ranked #85 on Natural Language Inference on SNLI
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.
Ranked #1 on Video Captioning on ActivityNet Captions
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.