We study the challenging problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions.
Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e. g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs').
Ranked #5 on Vision and Language Navigation on VLN Challenge
We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions.
Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks.
Ranked #4 on Visual Question Answering on CLEVR