In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!.
Ranked #1 on Speech Enhancement on WHAM!
The DOAs are fed to a fusion center, concatenated, and used to perform the localization based on two proposed methods, which require only few labeled source locations (anchor points) for training.
The solution presented in this paper is to train a neural network on pairs of microphones with different spacing and acoustic environmental conditions, and then use this network to estimate a time-frequency mask from all the pairs of microphones forming the array with an arbitrary shape.
For sound source tracking, this paper presents a modified 3D Kalman (M3K) method capable of simultaneously tracking in 3D the directions of sound sources.
Audio and Speech Processing Sound
Speech recognizers trained on close-talking speech do not generalize to distant speech and the word error rate degradation can be as large as 40% absolute.