Audio-Visual Speech Recognition

27 papers with code • 3 benchmarks • 6 datasets

Audio-visual speech recognition is the task of transcribing a paired audio and visual stream into text.

Benchmarks

Add a Result

These leaderboards are used to track progress in Audio-Visual Speech Recognition

Dataset	Best Model	Compare
LRS3-TED	CTC/Attention	See all
LRS2	CTC/Attention	See all
LRW	2DCNN + BiLSTM + ResNet + MLF	See all

Datasets

Latest papers with no code

Most implemented Social Latest No code

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

no code yet • 19 Apr 2024

Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years.

Paper
Add Code

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

no code yet • 21 Mar 2024

It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes.

Paper
Add Code

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

no code yet • 8 Feb 2024

Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.

Paper
Add Code

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

no code yet • 18 Jan 2024

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio.

Paper
Add Code

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

no code yet • 7 Jan 2024

While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness.

Paper
Add Code

AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

no code yet • 29 Sep 2023

Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR).

Paper
Add Code

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

no code yet • 15 Sep 2023

This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments.

Paper
Add Code

The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge

no code yet • 11 Mar 2023

This paper describes our NPU-ASLP system for the Audio-Visual Diarization and Recognition (AVDR) task in the Multi-modal Information based Speech Processing (MISP) 2022 Challenge.

Paper
Add Code

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices

no code yet • Sensors 2023

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise.

Paper
Add Code

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

no code yet • 10 Feb 2023

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems.

Paper
Add Code

Audio-Visual Speech Recognition

Benchmarks Add a Result

Datasets

Latest papers with no code

Content

Benchmarks

Add a Result