Improving Diversity and Reducing Redundancy in Paragraph Captions

The purpose of an image paragraph captioning model is to produce detailed descriptions of the source images. Generally, paragraph captioning models use encoder-decoder based architectures similar to the standard image captioning models. The encoder is a CNN based model, and the decoder is a LSTM or GRU. The standard image captioning models produce unsatisfactory results for the paragraph captioning task due to the lack of diversity in the generated outputs [9]. The paragraphs generated from standard image captioning models lack in language diversity and contain redundant information. In this work, we have proposed an approach with language discriminator for increasing the diversity in language, and dissimilarity score using word mover’s distance [4] for reducing redundant information. Using this approach with a state-of-the-art model at testing time, we have improved the METEOR score from 13.63 to 19.01 for the Visual Genome dataset

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here