We demonstrate EMMA, an embodied multimodal agent which has been developed for the Alexa Prize SimBot challenge.
Compositionality – the ability to combine simpler concepts to understand & generate arbitrarily more complex conceptual structures – has long been thought to be the cornerstone of human language capacity.
The next generation of conversational AI systems need to: (1) process language incrementally, token-by-token to be more responsive and enable handling of conversational phenomena such as pauses, restarts and self-corrections; (2) reason incrementally allowing meaning to be established beyond what is said; (3) be transparent and controllable, allowing designers as well as the system itself to easily establish reasons for particular behaviour and tailor to particular user groups, or domains.
Since the advent of Transformer-based, pretrained language models (LM) such as BERT, Natural Language Understanding (NLU) components in the form of Dialogue Act Recognition (DAR) and Slot Recognition (SR) for dialogue systems have become both more accurate and easier to create for specific application domains.
Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation.
We demonstrate the usefulness of the data by training and evaluating strong baseline models for executing TPRs.
Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee.
Large language models are known to produce output which sounds fluent and convincing, but is also often wrong, e. g. "unfaithful" with respect to a rationale as retrieved from a knowledge base.
Anaphoric expressions, such as pronouns and referential descriptions, are situated with respect to the linguistic context of prior turns, as well as, the immediate visual environment.
As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations.
Automatic Speech Recognition (ASR) systems are increasingly powerful and more accurate, but also more numerous with several options existing currently as a service (e. g. Google, IBM, and Microsoft).
Dialogue technologies such as Amazon's Alexa have the potential to transform the healthcare industry.
Learning with minimal data is one of the key challenges in the development of practical, production-ready goal-oriented dialogue systems.
We have recently seen the emergence of several publicly available Natural Language Understanding (NLU) toolkits, which map user utterances to structured, but more abstract, Dialogue Act (DA) or Intent specifications, while making this process accessible to the lay developer.
To test the model's generalisation potential, we evaluate the same model on the bAbI+ dataset, without any additional training.
We present an optimised multi-modal dialogue agent for interactive learning of visually grounded word meanings from a human tutor, trained on real human-human tutoring data.
We motivate and describe a new freely available human-human dialogue dataset for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner.
We present a multi-modal dialogue system for interactive learning of perceptually grounded word meanings from a human tutor.
Our experiments show that our model can process 74% of the Facebook AI bAbI dataset even when trained on only 0. 13% of the data (5 dialogues).
Results show that the semantic accuracy of the MemN2N model drops drastically; and that although it is in principle able to learn to process the constructions in bAbI+, it needs an impractical amount of training data to do so.
We present VOILA: an optimised, multi-modal dialogue agent for interactive learning of visually grounded word meanings from a human user.
We present a method for inducing new dialogue systems from very small amounts of unannotated dialogue data, showing how word-level exploration using Reinforcement Learning (RL), combined with an incremental and semantic grammar - Dynamic Syntax (DS) - allows systems to discover, generate, and understand many new dialogue variants.