22 papers with code • 1 benchmarks • 3 datasets
( Image credit: No Metrics Are Perfect )
Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem.
The task of multi-image cued story generation, such as visual storytelling dataset (VIST) challenge, is to compose multiple coherent sentences from a given sequence of images.
We present a neural model for generating short stories from image sequences, which extends the image description model by Vinyals et al. (Vinyals et al., 2015).
Visual storytelling and story comprehension are uniquely human skills that play a central role in how we learn about and experience the world.
The visual storytelling (VST) task aims at generating a reasonable and coherent paragraph-level story with the image stream as input.
To solve this problem, we propose a method to mine the cross-modal rules to help the model infer these informative concepts given certain visual input.
Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr.