In this work, we propose Exformer, a time-domain architecture for target speaker extraction.
In real life, room effect, also known as room reverberation, and the present background noise degrade the quality of speech.
As deep speech enhancement algorithms have recently demonstrated capabilities greatly surpassing their traditional counterparts for suppressing noise, reverberation and echo, attention is turning to the problem of packet loss concealment (PLC).
Singing voice separation aims to separate music into vocals and accompaniment components.
Neural vocoders have recently demonstrated high quality speech synthesis, but typically require a high computational complexity.
Neural speech synthesis models can synthesize high quality speech but typically require a high computational complexity to do so.
Automatic speech recognition (ASR) in the cloud allows the use of larger models and more powerful multi-channel signal processing front-ends compared to on-device processing.
The presence of multiple talkers in the surrounding environment poses a difficult challenge for real-time speech communication systems considering the constraints on network size and complexity.
Given a limited set of labeled data, we present a method to leverage a large volume of unlabeled data to improve the model's performance.
Audio codecs based on discretized neural autoencoders have recently been developed and shown to provide significantly higher compression levels for comparable quality speech output.
Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data.
We demonstrate that LPCNet operating at 1. 6 kb/s achieves significantly higher quality than MELP and that uncompressed LPCNet can exceed the quality of a waveform codec operating at low bitrate.
We demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS.
Despite noise suppression being a mature area in signal processing, it remains highly dependent on fine tuning of estimator algorithms and parameters.
Sound Audio and Speech Processing