Vision and Language Navigation
104 papers with code • 5 benchmarks • 13 datasets
Libraries
Use these libraries to find Vision and Language Navigation models and implementationsMost implemented papers
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering.
Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task.
Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View
These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn.
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions.
How Much Can CLIP Benefit Vision-and-Language Tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.
The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation
As deep learning continues to make progress for challenging perception tasks, there is increased interest in combining vision, language, and decision-making.
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset.
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments.
Airbert: In-domain Pretraining for Vision-and-Language Navigation
Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling.