We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network.
Our model substantially outperforms the baseline on the MultiWOZ data and shows competitive performance with state of the art in both automatic and human evaluation.
Ranked #3 on End-To-End Dialogue Modelling on MULTIWOZ 2.0 (using extra training data)
This precludes the use of the learned policy on a real robot.
However, the application of deep RL to visual navigation with realistic environments is a challenging task.