StyleNet: Generating Attractive Visual Captions With Styles

CVPR 2017  ·  Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, Li Deng ·

We propose a novel framework named StyleNet to address the task of generating attractive captions for images and videos with different styles. To this end, we devise a novel model component, named factored LSTM, which automatically distills the style factors in the monolingual text corpus. Then at runtime, we can explicitly control the style in the caption generation process so as to produce attractive visual captions with the desired style. Our approach achieves this goal by leveraging two sets of data: 1) factual image/video-caption paired data, and 2) stylized monolingual text data (e.g., romantic and humorous sentences). We show experimentally that StyleNet outperforms existing approaches for generating visual captions with different styles, measured in both automatic and human evaluation metrics on the newly collected FlickrStyle10K image caption dataset, which contains 10K Flickr images with corresponding humorous and romantic captions.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


Introduced in the Paper:

FlickrStyle10K

Used in the Paper:

Flickr30k

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here