Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data.
Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model.
In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation.
We achieve a new state-of-the-art of 12. 8% WER for visual speech recognition on the TED LRS3 dataset, which rivals the performance of audio-only models from just four years ago.
As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model.
Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio.
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face.
We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1. 6% WER).
In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end.
To improve streaming models, a recent study  proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions.
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture.
Ranked #5 on Audio-Visual Speech Recognition on LRS3-TED (using extra training data)