It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.
Ranked #8 on Lipreading on CAS-VSR-W1k (LRW-1000)
A recently proposed multimodal fusion strategy, AV Align, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligning the acoustic and visual representations of speech.
Considering the non-negligible effects of these strategies and the existing tough status to train an effective lip reading model, we perform a comprehensive quantitative study and comparative analysis, for the first time, to show the effects of several different choices for lip reading.
Ranked #1 on Lipreading on CAS-VSR-W1k (LRW-1000) (using extra training data)
Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information.
In recent years, Automatic Speech Recognition (ASR) technology has approached human-level performance on conversational speech under relatively clean listening conditions.
The audio-visual speech fusion strategy AV Align has shown significant performance improvements in audio-visual speech recognition (AVSR) on the challenging LRS2 dataset.
To solve this problem, we present a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and show how the addition of unseen class samples increases the accuracy of a VSR system by a significant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases.
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
Ranked #1 on Lipreading on LRS2 (using extra training data)