Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning

When we humans tell a long paragraph about an image, we usually first implicitly compose a mental “script” and then comply with it to generate the paragraph. Inspired by this, we render the modern encoder-decoder based image paragraph captioning model such ability by proposing Hierarchical Scene Graph Encoder-Decoder (HSGED) for generating coherent and distinctive paragraphs. In particular, we use the image scene graph as the “script” to incorporate rich semantic knowledge and, more importantly, the hierarchical constraints into the model. Specifically, we design a sentence scene graph RNN (SSG-RNN) to generate sub-graph level topics, which constrain the word scene graph RNN (WSG-RNN) to generate the corresponding sentences. We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. An efficient sentence-level loss is also proposed for encouraging the sequence of generated sentences to be similar to that of the ground-truth paragraphs. We validate HSGED on Stanford image paragraph dataset and show that it not only achieves a new state-of-the-art 36.02 CIDEr-D, but also generates more coherent and distinctive paragraphs under various metrics.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Image Paragraph Captioning Image Paragraph Captioning HSGED(SLL) BLEU-4 11.26 # 1
METEOR 18.33 # 4
CIDEr 36.02 # 1

Methods


No methods listed for this paper. Add relevant methods here