Generating Diverse and Descriptive Image Captions Using Visual Paraphrases

Recently there has been significant progress in image captioning with the help of deep learning. However, captions generated by current state-of-the-art models are still far from satisfactory, despite high scores in terms of conventional metrics such as BLEU and CIDEr. Human-written captions are diverse, informative and precise, but machine-generated captions seem to be simple, vague and dull. In this paper, aimed at improving diversity and descriptiveness characteristics of generated image captions, we propose a model utilizing visual paraphrases (different sentences describing the same image) in captioning datasets. We explore different strategies to select useful visual paraphrase pairs for training by designing a variety of scoring functions. Our model consists of two decoding stages, where a preliminary caption is generated in the first stage and then paraphrased into a more diverse and descriptive caption in the second stage. Extensive experiments are conducted on the benchmark MS COCO dataset, with automatic evaluation and human evaluation results verifying the effectiveness of our model.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here