Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data.
Automatic speech recognition (ASR) models with low-footprint are increasingly being deployed on edge devices for conversational agents, which enhances privacy.
By fine-tuning an ASR model on synthetic stuttered speech we are able to reduce word error by 5. 7% relative on stuttered utterances, with only minor (<0. 2% relative) degradation for fluent utterances.
Contrastive Predictive Coding (CPC) is a representation learning method that maximizes the mutual information between intermediate latent representations and the output of a given model.
no code implementations • 19 Jul 2022 • Gopinath Chennupati, Milind Rao, Gurpreet Chadha, Aaron Eakin, Anirudh Raju, Gautam Tiwari, Anit Kumar Sahu, Ariya Rastrow, Jasha Droppo, Andy Oberlin, Buddha Nandanoor, Prahalad Venkataramanan, Zheng Wu, Pankaj Sitpure
For end-to-end automatic speech recognition (ASR) tasks, the absence of human annotated labels along with the need for privacy preserving policies for model building makes it a daunting challenge.
A popular approach is to fine-tune the model with data from regions where the ASR model has a higher word error rate (WER).
Results show that the pairwise weighting method can achieve 1. 08% overall EER, 1. 25% for male and 0. 67% for female speakers, with relative EER reductions of 7. 7%, 10. 1% and 3. 0%, respectively.
This is observed especially with underrepresented demographic groups sharing similar voice characteristics.
Our observations show that SynthASR holds great promise in training the state-of-the-art large-scale E2E ASR models for new applications while reducing the costs and dependency on production data.
We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer.
This paper proposes a new neural text-to-speech model that approaches the disentanglement problem by conditioning a Tacotron2-like architecture on flow-normalized speaker embeddings, and by substituting the reference encoder with a new learned latent distribution responsible for modeling the intra-sentence variability due to the prosody.
An ASR model that operates on both primary and auxiliary data can achieve better accuracy compared to a primary-only solution; and a model that can serve both primary-only (PO) and primary-plus-auxiliary (PPA) modes is highly desirable.
On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3. 33% relative WERR.
The end-to-end 2D Conv-Attention model is compared with a multi-head self-attention and superdirective-based neural beamformers.
However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2. 0 network from the quantized representations in a way similar to a VQ-VAE model.
Spoken language understanding (SLU) systems extract transcriptions, as well as semantics of intent or named entities from speech, and are essential components of voice activated systems.
no code implementations • 29 Dec 2020 • Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek
This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS).
Unlike previous work on this topic, which performs on-the-fly limited-size beam-search decoding and generates alignment scores for expected edit-distance computation, in our proposed method, we re-calculate and sum scores of all the possible alignments for each hypothesis in N-best lists.
However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node.
In this paper we approach this question by comparing the output of our most accurate CTS recognition system to that of a standard speech transcription vendor pipeline.
We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion.
We propose to train bi-directional neural network language model(NNLM) with noise contrastive estimation(NCE).