Vision and Language Navigation
78 papers with code • 5 benchmarks • 12 datasets
Libraries
Use these libraries to find Vision and Language Navigation models and implementationsDatasets
Most implemented papers
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering.
Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task.
Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View
These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn.
How Much Can CLIP Benefit Vision-and-Language Tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.
The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation
As deep learning continues to make progress for challenging perception tasks, there is increased interest in combining vision, language, and decision-making.
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions.
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset.
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments.
Airbert: In-domain Pretraining for Vision-and-Language Navigation
Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
In this paper, we take a radical approach to bridge the gap between synthetic studies and real-world practices---We propose a novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task.