Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1. 4% and 2. 6% absolute for existing and new domains respectively.
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently.
no code implementations • 15 Nov 2022 • Derek Xu, Shuyan Dong, Changhan Wang, Suyoun Kim, Zhaojiang Lin, Akshat Shrivastava, Shang-Wen Li, Liang-Hsuan Tseng, Alexei Baevski, Guan-Ting Lin, Hung-Yi Lee, Yizhou Sun, Wei Wang
Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information.
In this work, we present our Joint Audio/Text training method for Transformer Rescorer, to leverage unpaired text-only data which is relatively cheaper than paired audio-text data.
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU), where a streaming automatic speech recognition (ASR) model produces the first-pass hypothesis and a second-pass natural language understanding (NLU) component generates the semantic parse by conditioning on both ASR's text and audio embeddings.
Measuring automatic speech recognition (ASR) system quality is critical for creating user-satisfying voice-driven applications.
no code implementations • 5 Apr 2021 • Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer
How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area.
We define SemDist as the distance between a reference and hypothesis pair in a sentence-level embedding space.
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers.
Ranked #14 on Speech Recognition on LibriSpeech test-clean
Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech recognition model architectures, has an implicit neural network language model (NNLM) and cannot easily leverage unpaired text data during training.
We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information.
We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings.
Existing speech recognition systems are typically built at the sentence level, although it is known that dialog context, e. g. higher-level knowledge that spans across sentences or speakers, can help the processing of long conversations.
Building speech recognizers in multiple languages typically involves replicating a monolingual training recipe for each language, or utilizing a multi-task learning approach where models for different languages have separate output labels but share some internal parameters.
Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training.
Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments.
We propose a novel deep neural network architecture for speech recognition that explicitly employs knowledge of the background environmental noise within a deep neural network acoustic model.
Integration of multiple microphone data is one of the key ways to achieve robust speech recognition in noisy environments or when the speaker is located at some distance from the input device.
We propose a transfer deep learning (TDL) framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality.