Finding beans in burgers: Deep semantic-visual embedding with localization

Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning... (read more)

Results in Papers With Code
(↓ scroll down to see all results)