Vision-Language Navigation
31 papers with code • 1 benchmarks • 7 datasets
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments.
( Image credit: Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout )
Most implemented papers
Structured Scene Memory for Vision-Language Navigation
Recently, numerous algorithms have been developed to tackle the problem of vision-language navigation (VLN), i. e., entailing an agent to navigate 3D environments through following linguistic instructions.
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation
Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information
One key challenge in this task is to ground instructions with the current visual information that the agent perceives.
Vision-Language Navigation with Random Environmental Mixup
Then, we cross-connect the key views of different scenes to construct augmented scenes.
Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation
Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target by destroying the most instructive information in instructions at different timesteps.
Contrastive Instruction-Trajectory Learning for Vision-Language Navigation
The vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app.
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration
To improve the ability of fast cross-domain adaptation, we propose Prompt-based Environmental Self-exploration (ProbES), which can self-explore the environments by sampling trajectories and automatically generates structured instructions via a large-scale cross-modal pretrained model (CLIP).
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
Since the rise of vision-language navigation (VLN), great progress has been made in instruction following -- building a follower to navigate environments under the guidance of instructions.
Reinforced Structured State-Evolution for Vision-Language Navigation
However, the crucial navigation clues (i. e., object-level environment layout) for embodied navigation task is discarded since the maintained vector is essentially unstructured.