GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints

CVPR 2018 · Fuhai Chen, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, Jinsong Su ·

Most image captioning models focus on one-line (single image) captioning, where the correlations like relevance and diversity among group images (e.g., within the same album or event) are simply neglected, resulting in less accurate and diverse captions. Recent works mainly consider imposing the diversity during the online inference only, which neglect the correlation among visual structures in offline training. In this paper, we propose a novel group-based image captioning scheme (termed GroupCap), which jointly models the structured relevance and diversity among visual contents of group images towards an optimal collaborative captioning. In particular, we first propose a visual tree parser (VP-Tree) to construct the structured semantic correlations within individual images. Then, the relevance and diversity among images are well modeled by exploiting the correlations among their tree structures. Finally, such correlations are modeled as constraints and sent into the LSTM-based captioning generator. In offline optimization, we adopt an end-to-end formulation, which jointly trains the visual tree parser, the structured relevance and diversity constraints, as well as the LSTM based captioning model. To facilitate quantitative evaluation, we further release two group captioning datasets derived from the MS-COCO benchmark, serving as the first of their kind. Quantitative results show that the proposed GroupCap model outperforms the state-of-the-art and alternative approaches, which can generate much more accurate and discriminative captions under various evaluation metrics.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Image Captioning

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

LSTM • Sigmoid Activation • Tanh Activation

Edit Social Preview

GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove