Visual Speech Recognition
35 papers with code • 2 benchmarks • 5 datasets
Most implemented papers
Combining Residual Networks with LSTMs for Lipreading
We propose an end-to-end deep learning architecture for word-level visual speech recognition.
Deep Audio-Visual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
End-to-end Audio-visual Speech Recognition with Conformers
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.
LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild
It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.
Visual Speech Recognition for Multiple Languages in the Wild
However, these advances are usually due to the larger training sets rather than the model design.
Deep word embeddings for visual speech recognition
In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition.
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information.
Harnessing GANs for Zero-shot Learning of New Classes in Visual Speech Recognition
To solve this problem, we present a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and show how the addition of unseen class samples increases the accuracy of a VSR system by a significant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases.
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture.
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition
Recent advances in deep learning have heightened interest among researchers in the field of visual speech recognition (VSR).