By using the submodular function, a training set for automatic speech recognition matching the target data set is selected.
A separate regression neural network is trained for each source-target language pair to transform posteriors from source acoustic model to the target language.
This technique measures the similarities between posterior distributions from various monolingual acoustic models against a target speech signal.
This shows positive information transfer from acted datasets to those with more natural emotions and the benefits from training on different corpora.
This paper analyses and explores the internal dynamics between layers during training with CNN, LSTM and Transformer based approaches using Canonical correlation analysis (CCA) and centered kernel alignment (CKA) for the experiments.
It is shown that this weighted multi-dilation temporal convolutional network (WD-TCN) consistently outperforms the TCN across various model configurations and using the WD-TCN model is a more parameter efficient method to improve the performance of the model than increasing the number of convolutional blocks.
Ranked #1 on Speech Dereverberation on WHAMR!
A feature of TCNs is that they have a receptive field (RF) dependent on the specific model configuration which determines the number of input frames that can be observed to produce an individual output frame.
Ranked #1 on Speech Dereverberation on WHAMR_ext
It was shown recently that a combination of ASR and TTS models yield highly competitive performance on standard voice conversion tasks such as the Voice Conversion Challenge 2020 (VCC2020).
Training of speech enhancement systems often does not incorporate knowledge of human perception and thus can lead to unnatural sounding results.
In this method, multiple automatic speech recognition (ASR) 1-best hypotheses are integrated in the computation of the connectionist temporal classification (CTC) loss function.
The use of memory mechanism could reach 10. 6% and 7. 7% relative improvement compared with not using memory mechanism.
In this work, we aim at improving the data efficiency of the model and achieving a many-to-many non-parallel StarGAN-based voice conversion for a relatively large number of speakers with limited training samples.
Sound Audio and Speech Processing
The first task is to predict an utterance quality score, and the second is to identify where an anomalous distortion takes place in a recording.
The obtained results show that the proposed approach using speaker dependent speech enhancement can yield better speaker recognition and speech enhancement performances than two baselines in various noise conditions.
To evaluate the effectiveness of the proposed approach, artificial datasets based on Switchboard Cellular part1 (SWBC) and Voxceleb1 are constructed in two conditions, where speakers' voices are overlapped and not overlapped.
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
In the proposed approach, frame-level encoder and attention are applied on segments of an input utterance and generate individual segment vectors.
To evaluate the effectiveness of our approaches compared to prior work, two tasks are conducted -- phone classification and speaker recognition -- and test on different TIMIT data sets.
While the use of deep neural networks has significantly boosted speaker recognition performance, it is still challenging to separate speakers in poor acoustic environments.
The proposed technique for training data selection, significantly outperforms random selection, posterior-based selection as well as using all of the available data.
This paper introduces a new British English speech database, named the homeService corpus, which has been gathered as part of the homeService project.
We describe the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge task of transcribing multi-genre broadcast shows.
This paper presents a new method for the discovery of latent domains in diverse speech data, for the use of adaptation of Deep Neural Networks (DNNs) for Automatic Speech Recognition.
The USFD primary system incorporates state-of-the-art ASR and MT techniques and gives a BLEU score of 23. 45 and 14. 75 on the English-to-French and English-to-German speech-to-text translation task with the IWSLT 2014 data.
Negative transfer in training of acoustic models for automatic speech recognition has been reported in several contexts such as domain change or speaker characteristics.
Hence it is often not evident if data should be considered to be out-of-domain.