Automatic evaluation of text generation tasks (e. g. machine translation, text summarization, image captioning and video description) usually relies heavily on task-specific metrics, such as BLEU and ROUGE.
In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions.
The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips.
This paper strives to find amidst a set of sentences the one best describing the content of a given image or video.
Scene-aware dialog systems will be able to have conversations with users about the objects and events around them.
Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video.
We propose a novel methodology that exploits information from temporally neighboring events, matching precisely the nature of egocentric sequences.