Streaming speech recognition architectures are employed for low-latency, real-time applications.
no code implementations • 1 Mar 2023 • Feng-Ju Chang, Anastasios Alexandridis, Rupak Vignesh Swaminathan, Martin Radfar, Harish Mallidi, Maurizio Omologo, Athanasios Mouchtaris, Brian King, Roland Maas
We augment the MC fusion networks to a conformer transducer model and train it in an end-to-end fashion.
In particular, hidden Markov models are developed for the traffic lanes and speed change of vehicles on highway.
As for other forms of AI, speech recognition has recently been examined with respect to performance disparities across different user cohorts.
A popular approach is to fine-tune the model with data from regions where the ASR model has a higher word error rate (WER).
We present a streaming, Transformer-based end-to-end automatic speech recognition (ASR) architecture which achieves efficient neural inference through compute cost amortization.
However, the options for count time series are limited: Gaussian DLMs require continuous data, while Poisson-based alternatives often lack sufficient modeling flexibility.
We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer.
An ASR model that operates on both primary and auxiliary data can achieve better accuracy compared to a primary-only solution; and a model that can serve both primary-only (PO) and primary-plus-auxiliary (PPA) modes is highly desirable.
The end-to-end 2D Conv-Attention model is compared with a multi-head self-attention and superdirective-based neural beamformers.
Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms.
Understanding how nano- or micro-scale structures and material properties can be optimally configured to attain specific functionalities remains a fundamental challenge.
Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time.