An ASR model that operates on both primary and auxiliary data can achieve better accuracy compared to a primary-only solution; and a model that can serve both primary-only (PO) and primary-plus-auxiliary (PPA) modes is highly desirable.
The end-to-end 2D Conv-Attention model is compared with a multi-head self-attention and superdirective-based neural beamformers.
Accents mismatching is a critical problem for end-to-end ASR.
In this work we introduce a semi-supervised approach to the voice conversion problem, in which speech from a source speaker is converted into speech of a target speaker.
CPU branch prediction has hit a wall--existing techniques achieve near-perfect accuracy on 99% of static branches, and yet the mispredictions that remain hide major performance gains.
We present a rapid design methodology that combines automated hyper-parameter tuning with semi-supervised training to build highly accurate and robust models for voice commands classification.
Due to the use of a single encoder, our method can generalize to converting the voice of out-of-training speakers to speakers in the training dataset.
We present a Cycle-GAN based many-to-many voice conversion method that can convert between speakers that are not in the training set.