Audio-Visual Speech Recognition

Audio-visual speech recognition is the task of transcribing a paired audio and visual stream into text.

Most implemented papers

Deep Audio-Visual Speech Recognition

lordmartian/deep_avsr 6 Sep 2018

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

Discriminative Multi-modality Speech Recognition

JackSyu/Discriminative-Multi-modality-Speech-Recognition CVPR 2020

Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates.

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

around-star/Speech-Recognition 8 Nov 2019

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture.

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

georgesterpu/Sigmedia-AVSR 17 Apr 2020

A recently proposed multimodal fusion strategy, AV Align, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligning the acoustic and visual representations of speech.

Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition

georgesterpu/Taris 19 May 2020

The audio-visual speech fusion strategy AV Align has shown significant performance improvements in audio-visual speech recognition (AVSR) on the challenging LRS2 dataset.

AV Taris: Online Audio-Visual Speech Recognition

georgesterpu/Taris 14 Dec 2020

In recent years, Automatic Speech Recognition (ASR) technology has approached human-level performance on conversational speech under relatively clean listening conditions.

End-to-end Audio-visual Speech Recognition with Conformers

mpc001/Visual_Speech_Recognition_for_Multiple_Languages 12 Feb 2021

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.

Robust Self-Supervised Audio-Visual Speech Recognition

facebookresearch/av_hubert 5 Jan 2022

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.

CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

hltchkust/ci-avsr 11 Jan 2022

With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities.

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

lumia-group/leveraging-self-supervised-learning-for-avsr ACL 2022

In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding.