34 papers with code • 3 benchmarks • 6 datasets
Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.
Source: Mutual Information Maximization for Effective Lip Reading
These leaderboards are used to track progress in Lip Reading
Most implemented papers
Combining Residual Networks with LSTMs for Lipreading
We propose an end-to-end deep learning architecture for word-level visual speech recognition.
Deep Audio-Visual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild
It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.
Lipreading using Temporal Convolutional Networks
We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.
AuthNet: A Deep Learning based Authentication Mechanism using Temporal Facial Feature Movements
Biometric systems based on Machine learning and Deep learning are being extensively used as authentication mechanisms in resource-constrained environments like smartphones and other small computing devices.
End-to-end Audio-visual Speech Recognition with Conformers
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.
Estimating speech from lip dynamics
The goal of this project is to develop a limited lip reading algorithm for a subset of the English language.
XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification
Our work improves on existing multimodal deep learning algorithms in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections which only transfer information between streams that process compatible data.
Lip2AudSpec: Speech reconstruction from silent lip movements video
In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos.