Look Deeper See Richer: Depth-aware Image Paragraph Captioning

With the widespread availability of image captioning at a sentence level, how to automatically generate image paragraphs is yet well explored. Describing an image by a full paragraph involves organising sentences orderly, coherently and diversely, inevitably leading higher complexity than by a single sentence. Existing image paragraph captioning methods give a series of sentences to represent the objects and regions of interests, where the descriptions are essentially generated by feeding the image fragments containing objects and regions into conventional image single-sentence captioning models. This strategy is difficult to generate the descriptions that guarantee the stereoscopic hierarchy and non-overlapping objects. In this paper, we propose a Depth-aware Attention Model (\textitDAM ) to generate paragraph captions for images. The depths of image areas are firstly estimated in order to discriminate objects in a range of spatial locations, which can further guide the linguistic decoder to reveal spatial relationships among objects. This model completes the paragraph in a logical and coherent manner. By incorporating the attention mechanism, the learned model swiftly shifts the sentence focus during paragraph generation, whilst avoiding verbose descriptions on a same object. Extensive quantitative experiments and the user study have been conducted on the Visual Genome dataset, which demonstrate the effectiveness and the interpretability of the proposed model.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Image Paragraph Captioning Image Paragraph Captioning Depth-aware Attention Model (DAM) BLEU-4 6.7 # 9
METEOR 13.9 # 9
CIDEr 17.3 # 9

Methods


No methods listed for this paper. Add relevant methods here