We study the problem of synthesizing immersive 3D indoor scenes from one or more images.
We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes.
Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself.
By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning.
Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to a transcript, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of domains, intents, and arguments.