Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding

ICCV 2017  ·  Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, Gang Hua ·

We address the problem of dense visual-semantic embedding that maps not only full sentences and whole images but also phrases within sentences and salient regions within images into a multimodal embedding space. As a result, we can produce several region-oriented and expressive phrases rather than just an overview sentence to describe an image. In particular, we present a hierarchical structured recurrent neural network (RNN), namely Hierarchical Multimodal LSTM (HM-LSTM) model. Different from chain structured RNN, our model presents a hierarchical structure so that it can naturally build representations for phrases and image regions, and further exploit their hierarchical relations. Moreover, the fine-grained correspondences between phrases and image regions can be automatically learned and utilized to boost the learning of the dense embedding space. Extensive experiments on several datasets validate the efficacy of our proposed method, which compares favorably with the state-of-the-art methods.

PDF Abstract
No code implementations yet. Submit your code now

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here