With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise.
Two proposed modules are added to the end-to-end TTS framework: an intonation predictor and an intonation encoder.
In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers.
Research in speaker recognition has recently seen significant progress due to the application of neural network models and the availability of new large-scale datasets.
The experimental results show that the use of duration and score fusion improves language recognition performance by 5% relative in LRiMLC15 cost.