Audio-Visual Speech Recognition

28 papers with code • 3 benchmarks • 6 datasets

Audio-visual speech recognition is the task of transcribing a paired audio and visual stream into text.

Benchmarks

Add a Result

These leaderboards are used to track progress in Audio-Visual Speech Recognition

Dataset	Best Model	Compare
LRS3-TED	CTC/Attention	See all
LRS2	CTC/Attention	See all
LRW	AVCRFormer	See all

Datasets

Latest papers

Most implemented Social Latest No code

Audio-Visual Speech Recognition based on Regulated Transformer and Spatio-Temporal Fusion Strategy for Driver Assistive Systems

SMIL-SPCRAS/AVCRFormer • Expert Systems with Applications 2024

The article introduces a novel audio-visual speech command recognition transformer (AVCRFormer) specifically designed for robust AVSR.

09 May 2024

Paper
Code

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

dalision/modalbiasavsr • 7 Mar 2024

In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason.

07 Mar 2024

Paper
Code

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

zqs01/multi-channel-wav2vec2 • • 7 Jan 2024

Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs.

07 Jan 2024

Paper
Code

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

spkgyk/RTFS-Net • • 29 Sep 2023

This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

29 Sep 2023

Paper
Code

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

mispchallenge/misp-icme-avsr • • 14 Aug 2023

In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.

14 Aug 2023

Paper
Code

MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition

yuchen005/mir-gan • • 18 Jun 2023

In this paper, we aim to learn the shared representations across modalities to bridge their gap.

18 Jun 2023

Paper
Code

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

yuchen005/univpm • • 18 Jun 2023

In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a. k. a., unsupervised noise adaptation.

18 Jun 2023

Paper
Code

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment

exgc/opensr • • 10 Jun 2023

We demonstrate that OpenSR enables modality transfer from one to any in three different settings (zero-, few- and full-shot), and achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.

10 Jun 2023

Paper
Code

MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

springhuo/mavd • 4 Jun 2023

Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction.

04 Jun 2023

Paper
Code

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

jasonppy/promptingwhisper • • 18 May 2023

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.

123

18 May 2023

Paper
Code

Audio-Visual Speech Recognition

Benchmarks Add a Result

Datasets

Latest papers

Content

Benchmarks

Add a Result