Modern web content - news articles, blog posts, educational resources, marketing brochures - is predominantly multimodal.
Common image-text joint understanding techniques presume that images and the associated text can universally be characterized by a single implicit model.
Visual storytelling and story comprehension are uniquely human skills that play a central role in how we learn about and experience the world.
Given its crucial role, there is a need to better understand and model the dynamics of GitHub as a social platform.
We propose an end-to-end network for the visual illustration of a sequence of sentences forming a story.