Measurement of interaction quality is a critical task for the improvement of spoken dialog systems.
We conclude that this improvement in ASC performance comes from the regularization effect of using AET and not from the network's improved ability to discern between acoustic events.
Standard acoustic event classification (AEC) solutions require large-scale collection of data from client devices for model optimization.
Ensuring model robustness or resilience in the skill routing component is an important problem since skills may dynamically change their subscription in the ontology after the skill routing model has been deployed to production.
Speech emotion recognition (SER) is a key technology to enable more natural human-machine communication.
Ranked #2 on Speech Emotion Recognition on MSP-Podcast (Dominance) (using extra training data)
Natural Language Understanding (NLU) is an established component within a conversational AI or digital assistant system, and it is responsible for producing semantic understanding of a user request.
Wake word (WW) spotting is challenging in far-field not only because of the interference in signal transmission but also the complexity in acoustic environments.
Dialogue level quality estimation is vital for optimizing data driven dialogue management.
Speech-based virtual assistants, such as Amazon Alexa, Google assistant, and Apple Siri, typically convert users' audio signals to text data through automatic speech recognition (ASR) and feed the text to downstream dialog models for natural language understanding and response generation.
To address these gaps, we created a new Response Quality annotation scheme, introduced five new domain-independent feature sets and experimented with six machine learning models to estimate User Satisfaction at both turn and dialogue level.
Training dialog policies for speech-based virtual assistants requires a plethora of conversational data.
An automated metric to evaluate dialogue quality is vital for optimizing data driven dialogue management.
Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems.
In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models.
This paper presents our work of training acoustic event detection (AED) models using unlabeled dataset.
Typical spoken language understanding systems provide narrow semantic parses using a domain-specific ontology.
We investigate low-bit quantization to reduce computational cost of deep neural network (DNN) based keyword spotting (KWS).
We explore active learning (AL) for improving the accuracy of new domains in a natural language understanding (NLU) system.
An ideal re-ranker will exhibit the following two properties: (a) it should prefer the most relevant hypothesis for the given input as the top hypothesis and, (b) the interpretation scores corresponding to each hypothesis produced by the re-ranker should be calibrated.
In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants.
This paper introduces a meaning representation for spoken language understanding.
Fast expansion of natural language functionality of intelligent virtual agents is critical for achieving engaging and informative interactions.
Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields $67. 6\%$ relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.