End-to-end (E2E) multi-channel ASR systems show state-of-the-art performance in far-field ASR tasks by joint training of a multi-channel front-end along with the ASR model.
We use a single-channel encoder for CT speech and a multi-channel encoder with Spatial Filtering neural beamforming for FT speech, which are jointly trained with the encoder selection.
Automatic speech-based affect recognition of individuals in dyadic conversation is a challenging task, in part because of its heavy reliance on manual pre-processing.
In the result, the Noisy Student algorithm with soft labels and consistency regularization achieves 10. 4% word error rate (WER) reduction when adding 475h of unlabeled data, corresponding to a recovery rate of 92%.
Sequence-to-sequence (seq2seq) based ASR systems have shown state-of-the-art performances while having clear advantages in terms of simplicity.
Transcription of broadcast news is an interesting and challenging application for large-vocabulary continuous speech recognition (LVCSR).
Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm.