|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.
#2 best model for Cross-Modal Retrieval on COCO 2014
Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle.
Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa.
Designing powerful tools that support cooking activities has rapidly gained popularity due to the massive amounts of available data, as well as recent advances in machine learning that are capable of analyzing them.
We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation.
We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language.