Lipreading

31 papers with code • 7 benchmarks • 6 datasets

Lipreading is a process of extracting speech by watching lip movements of a speaker in the absence of sound. Humans lipread all the time without even noticing. It is a big part in communication albeit not as dominant as audio. It is a very helpful skill to learn especially for those who are hard of hearing.

Deep Lipreading is the process of extracting speech from a video of a silent talking face using deep neural networks. It is also known by few other names: Visual Speech Recognition (VSR), Machine Lipreading, Automatic Lipreading etc.

The primary methodology involves two stages: i) Extracting visual and temporal features from a sequence of image frames from a silent talking video ii) Processing the sequence of features into units of speech e.g. characters, words, phrases etc. We can find several implementations of this methodology either done in two separate stages or trained end-to-end in one go.

Benchmarks

Add a Result

These leaderboards are used to track progress in Lipreading

Dataset	Best Model	Compare
Lip Reading in the Wild	3D Conv + ResNet-18 + DC-TCN + KD (Ensemble) (Word Boundary)	See all
LRS2	CTC/Attention	See all
LRS3-TED	CTC/Attention	See all
CAS-VSR-W1k (LRW-1000)	3D-ResNet + Bi-GRU + MixUp + Label Smooth + Cosine LR (Word Boundary)	See all
GRID corpus (mixed-speech)	CTC/Attention	See all
CMLR	CTC/Attention	See all
LRW-1000	3D Conv + ResNet-34 + Bi-GRU	See all

Libraries

Use these libraries to find Lipreading models and implementations

facebookresearch/av_hubert

2 papers

786

Datasets

Latest papers

Most implemented Social Latest No code

Audio-Visual Speech Recognition based on Regulated Transformer and Spatio-Temporal Fusion Strategy for Driver Assistive Systems

SMIL-SPCRAS/AVCRFormer • Expert Systems with Applications 2024

The article introduces a novel audio-visual speech command recognition transformer (AVCRFormer) specifically designed for robust AVSR.

09 May 2024

Paper
Code

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

sally-sh/vsp-llm • • 23 Feb 2024

In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements.

273

23 Feb 2024

Paper
Code

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

mpc001/auto_avsr • • 25 Mar 2023

Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets.

130

25 Mar 2023

Paper
Code

LipLearner: Customizable Silent Speech Interactions on Mobile Devices

rkmtlab/LipLearner • • 12 Feb 2023

Silent speech interface is a promising technology that enables private communications in natural language.

12 Feb 2023

Paper
Code

Jointly Learning Visual and Auditory Speech Representations from Raw Data

ahaliassos/raven • • 12 Dec 2022

We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained.

12 Dec 2022

Paper
Code

Relaxed Attention for Transformer Models

Oguzhanercan/Vision-Transformers • 20 Sep 2022

The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and - for natural language processing tasks - lead to an implicitly learned internal language model in the autoregressive transformer decoder complicating the integration of external language models.

20 Sep 2022

Paper
Code

Training Strategies for Improved Lip-reading

mpc001/Lipreading_using_Temporal_Convolutional_Networks • • 3 Sep 2022

In this paper, we systematically investigate the performance of state-of-the-art data augmentation approaches, temporal models and other training strategies, like self-distillation and using word boundary indicators.

366

03 Sep 2022

Paper
Code

Bayesian Neural Network Language Modeling for Speech Recognition

amourwaltz/bayeslms • • 28 Aug 2022

State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.

28 Aug 2022

Paper
Code

Visual Speech Recognition for Multiple Languages in the Wild

mpc001/Visual_Speech_Recognition_for_Multiple_Languages • • 26 Feb 2022

However, these advances are usually due to the larger training sets rather than the model design.

292

26 Feb 2022

Paper
Code

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

lumia-group/leveraging-self-supervised-learning-for-avsr • • ACL 2022

In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding.

24 Feb 2022

Paper
Code

Lipreading

Benchmarks Add a Result

Libraries

Datasets

Latest papers

Content

Benchmarks

Add a Result