The Vision-and-Language Navigation (VLN) task entails an agent following
navigational instruction in photo-realistic unknown environments. This
challenging task demands that the agent be aware of which instruction was
completed, which instruction is needed next, which way to go, and its
navigation progress towards the goal. In this paper, we introduce a
self-monitoring agent with two complementary components: (1) visual-textual
co-grounding module to locate the instruction completed in the past, the
instruction required for the next action, and the next moving direction from
surrounding images and (2) progress monitor to ensure the grounded instruction
correctly reflects the navigation progress. We test our self-monitoring agent
on a standard benchmark and analyze our proposed approach through a series of
ablation studies that elucidate the contributions of the primary components.
Using our proposed method, we set the new state of the art by a significant
margin (8% absolute increase in success rate on the unseen test set). Code is
available at https://github.com/chihyaoma/selfmonitoring-agent .