|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space.
In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.
Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa.
Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle.
We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation.
SOTA for Cross-Modal Retrieval on COCO 2014
Designing powerful tools that support cooking activities has rapidly gained popularity due to the massive amounts of available data, as well as recent advances in machine learning that are capable of analyzing them.
Effectively measuring the similarity between different modalities of data is the key of cross-modal retrieval.