Most prior methods for learning navigation policies require access to simulation environments, as they need online policy interaction and rely on ground-truth maps for rewards.
In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices.
Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video.
We describe a novel cross-modal embedding space for actions, named Action2Vec, which combines linguistic cues from class labels with spatio-temporal features derived from video clips.
Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision.