For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency.
Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and between attention layers.
We present a streaming, Transformer-based end-to-end automatic speech recognition (ASR) architecture which achieves efficient neural inference through compute cost amortization.
We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme for 8-bit neural network accelerators.
Personal rare word recognition in end-to-end Automatic Speech Recognition (E2E ASR) models is a challenge due to the lack of training data.
Dialogue act classification (DAC) is a critical task for spoken language understanding in dialogue systems.
We also leverage both BLSTM and pretrained BERT based models to encode contextual data and guide the network training.
In this paper, we introduce FANS, a new end-to-end SLU model that fuses an ASR audio encoder to a multi-task NLU decoder to infer the intent, slot tags, and slot values directly from a given input audio, obviating the need for transcription.
Ranked #13 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)
In this paper, we present a novel speech recognition model, Multi-Channel Transformer Transducer (MCTT), which features end-to-end multi-channel training, low computation cost, and low latency so that it is suitable for streaming decoding in on-device speech recognition.
End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model.
Ranked #9 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)
We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer.
We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind.
In the literature, such methods are referred to as sparse pruning.
Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms.
In this paper, we treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the `acoustic' and `text' embeddings.
In this paper, we introduce an end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots vectors embedded in an audio signal with no intermediate token prediction architecture.
We perform experiments where we vary the semantic complexity of a large, proprietary dataset and show that STI model performance correlates with our semantic complexity measures, such that performance increases as complexity values decrease.
no code implementations • 8 Jul 2020 • Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Ankish Bansal, Markus Müller, Sergio Murillo, Ariya Rastrow, Sri Garimella, Roland Maas, Mat Hans, Athanasios Mouchtaris, Siegfried Kunzmann
Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies.
Moreover, transmission also introduces delays and other distortion types which affect the perceived quality.