Lip Reading
47 papers with code • 3 benchmarks • 6 datasets
Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.
Source: Mutual Information Maximization for Effective Lip Reading
Most implemented papers
Combining Residual Networks with LSTMs for Lipreading
We propose an end-to-end deep learning architecture for word-level visual speech recognition.
Deep Audio-Visual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
End-to-end Audio-visual Speech Recognition with Conformers
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.
LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild
It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.
Lipreading using Temporal Convolutional Networks
We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.
AuthNet: A Deep Learning based Authentication Mechanism using Temporal Facial Feature Movements
Biometric systems based on Machine learning and Deep learning are being extensively used as authentication mechanisms in resource-constrained environments like smartphones and other small computing devices.
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech.
Estimating speech from lip dynamics
The goal of this project is to develop a limited lip reading algorithm for a subset of the English language.
XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification
Our work improves on existing multimodal deep learning algorithms in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections which only transfer information between streams that process compatible data.