Improving Diversity and Reducing Redundancy in Paragraph Captions

International Joint Conference on Neural Networks (IJCNN) 2020 · Kanani, Chandresh S., Sriparna Saha, and Pushpak Bhattacharyya ·

The purpose of an image paragraph captioning model is to produce detailed descriptions of the source images. Generally, paragraph captioning models use encoder-decoder based architectures similar to the standard image captioning models. The encoder is a CNN based model, and the decoder is a LSTM or GRU. The standard image captioning models produce unsatisfactory results for the paragraph captioning task due to the lack of diversity in the generated outputs [9]. The paragraphs generated from standard image captioning models lack in language diversity and contain redundant information. In this work, we have proposed an approach with language discriminator for increasing the diversity in language, and dissimilarity score using word mover’s distance [4] for reducing redundant information. Using this approach with a state-of-the-art model at testing time, we have improved the METEOR score from 13.63 to 19.01 for the Visual Genome dataset

PDF Abstract