Joint Learning of Distributed Representations for Images and Texts

13 Apr 2015 · Xiaodong He, Rupesh Srivastava, Jianfeng Gao, Li Deng ·

This technical report provides extra details of the deep multimodal similarity model (DMSM) which was proposed in (Fang et al. 2015, arXiv:1411.4952). The model is trained via maximizing global semantic similarity between images and their captions in natural language using the public Microsoft COCO database, which consists of a large set of images and their corresponding captions. The learned representations attempt to capture the combination of various visual concepts and cues.

PDF Abstract