The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips.
Although traditionally used in the machine translation field, the encoder-decoder framework has been recently applied for the generation of video and image descriptions.
This paper strives to find amidst a set of sentences the one best describing the content of a given image or video.
We introduce a new dataset of dialogs about videos of human behaviors.
Scene-aware dialog systems will be able to have conversations with users about the objects and events around them.
We propose a novel methodology that exploits information from temporally neighboring events, matching precisely the nature of egocentric sequences.