no code implementations • 11 May 2022 • Otavio Braga, Olivier Siohan
As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model.
no code implementations • 11 May 2022 • Otavio Braga, Takaki Makino, Olivier Siohan, Hank Liao
Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio.
no code implementations • 10 May 2022 • Otavio Braga, Olivier Siohan
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face.
no code implementations • 25 Jan 2022 • Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
In this work, we propose to replace the 3D convolution with a video transformer video feature extractor.
Audio-Visual Speech Recognition
Automatic Speech Recognition
+2
no code implementations • 20 Sep 2021 • Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end.
Audio-Visual Speech Recognition
Automatic Speech Recognition
+3
1 code implementation • 8 Nov 2019 • Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture.
Ranked #3 on
Audio-Visual Speech Recognition
on LRS3-TED