Lip Reading
46 papers with code • 3 benchmarks • 5 datasets
Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.
Source: Mutual Information Maximization for Effective Lip Reading
Latest papers with no code
Leveraging Visemes for Better Visual Speech Representation and Lip Reading
We evaluate our approach on various tasks, including word-level and sentence-level lip reading, and audiovisual speech recognition using the Arman-AV dataset, a largescale Persian corpus.
Emotional Speech-Driven Animation with Content-Emotion Disentanglement
While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions.
Deep Learning-based Spatio Temporal Facial Feature Visual Speech Recognition
In low-resource computing contexts, such as smartphones and other tiny devices, Both deep learning and machine learning are being used in a lot of identification systems.
PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors
Conventional image sensors digitize high-resolution images at fast frame rates, producing a large amount of data that needs to be transmitted off the sensor for further processing.
Word-level Persian Lipreading Dataset
Lip-reading has made impressive progress in recent years, driven by advances in deep learning.
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision
Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16. 9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90, 000 hours).
A large-scale multimodal dataset of human speech recognition
The dataset has been validated and has potential for the research of lip reading and multimodal speech recognition.
Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise.
A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech Recognition: the Arman-AV Dataset
In addition, we have proposed a technique to detect visemes (a visual equivalent of a phoneme) in Persian.
Speech Driven Video Editing via an Audio-Conditioned Diffusion Model
Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model.