Visual Speech Recognition

Combining Residual Networks with LSTMs for Lipreading

We propose an end-to-end deep learning architecture for word-level visual speech recognition.

Deep Audio-Visual Speech Recognition

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

End-to-end Audio-visual Speech Recognition with Conformers

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.

Visual Speech Recognition for Multiple Languages in the Wild

However, these advances are usually due to the larger training sets rather than the model design.

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech.

The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task.

Lip Reading Sentences in the Wild

Deep word embeddings for visual speech recognition

In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition.

Zero-shot keyword spotting for visual speech recognition in-the-wild

Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information.