A recently proposed multimodal fusion strategy, AV Align, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligning the acoustic and visual representations of speech.
Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates.
Ranked #5 on Lipreading on Lip Reading in the Wild
In recent years, Automatic Speech Recognition (ASR) technology has approached human-level performance on conversational speech under relatively clean listening conditions.
The audio-visual speech fusion strategy AV Align has shown significant performance improvements in audio-visual speech recognition (AVSR) on the challenging LRS2 dataset.
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
Ranked #1 on Lipreading on LRS2 (using extra training data)
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture.