( Image credit: No Metrics Are Perfect )
Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem.
The task of multi-image cued story generation, such as visual storytelling dataset (VIST) challenge, is to compose multiple coherent sentences from a given sequence of images.
We present a neural model for generating short stories from image sequences, which extends the image description model by Vinyals et al. (Vinyals et al., 2015).
The visual storytelling (VST) task aims at generating a reasonable and coherent paragraph-level story with the image stream as input.