Visual Speech Recognition

35 papers with code • 2 benchmarks • 5 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

Combining Residual Networks with LSTMs for Lipreading

tstafylakis/Lipreading-ResNet 12 Mar 2017

We propose an end-to-end deep learning architecture for word-level visual speech recognition.

Deep Audio-Visual Speech Recognition

smeetrs/deep_avsr 6 Sep 2018

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

End-to-end Audio-visual Speech Recognition with Conformers

zziz/pwc 12 Feb 2021

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

Fengdalu/Lipreading-DenseNet3D 16 Oct 2018

It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.

Visual Speech Recognition for Multiple Languages in the Wild

mpc001/Visual_Speech_Recognition_for_Multiple_Languages 26 Feb 2022

However, these advances are usually due to the larger training sets rather than the model design.

Deep word embeddings for visual speech recognition

tstafylakis/Lipreading-ResNet 30 Oct 2017

In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition.

Zero-shot keyword spotting for visual speech recognition in-the-wild

lilianemomeni/KWS-Net ECCV 2018

Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information.

Harnessing GANs for Zero-shot Learning of New Classes in Visual Speech Recognition

midas-research/DECA 29 Jan 2019

To solve this problem, we present a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and show how the addition of unseen class samples increases the accuracy of a VSR system by a significant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases.

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

around-star/Speech-Recognition 8 Nov 2019

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture.

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition

sailordiary/deep-face-vsr 6 Mar 2020

Recent advances in deep learning have heightened interest among researchers in the field of visual speech recognition (VSR).