53 papers with code • 5 benchmarks • 13 datasets
Visual Navigation is the problem of navigating an agent, e.g. a mobile robot, in an environment using camera input only. The agent is given a target image (an image it will see from the target position), and its goal is to move from its current position to the target by applying a sequence of actions, based on the camera observations only.
We propose to using high level semantic and contextual features including segmentation and detection masks obtained by off-the-shelf state-of-the-art vision as observations and use deep network to learn the navigation policy.
Second, we investigate the sim2real predictivity of Habitat-Sim for PointGoal navigation.
Learning a new task often requires both exploring to gather task-relevant information and exploiting this information to solve the task.
An Open Source and Open Hardware Deep Learning-powered Visual Navigation Engine for Autonomous Nano-UAVs
Nano-size unmanned aerial vehicles (UAVs), with few centimeters of diameter and sub-10 Watts of total power budget, have so far been considered incapable of running sophisticated visual-based autonomous navigation software without external aid from base-stations, ad-hoc local positioning infrastructure, and powerful external computation servers.
As part of our general methodology we discuss the software mapping techniques that enable the state-of-the-art deep convolutional neural network presented in  to be fully executed on-board within a strict 6 fps real-time constraint with no compromise in terms of flight results, while all processing is done with only 64 mW on average.
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering.
Ranked #3 on Visual Navigation on R2R
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e. g., a phone ringing in another room).