As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model.
Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio.
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face.
In this work, we propose to replace the 3D convolution with a video transformer video feature extractor.
In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end.
To improve streaming models, a recent study  proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions.
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture.
Ranked #3 on Audio-Visual Speech Recognition on LRS3-TED