Audio-Visual Speech Recognition

27 papers with code • 3 benchmarks • 6 datasets

Audio-visual speech recognition is the task of transcribing a paired audio and visual stream into text.

Latest papers with no code

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

no code yet • 19 Apr 2024

Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years.

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

no code yet • 21 Mar 2024

It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes.

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

no code yet • 8 Feb 2024

Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

no code yet • 18 Jan 2024

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio.

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

no code yet • 7 Jan 2024

While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness.

AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

no code yet • 29 Sep 2023

Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR).

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

no code yet • 15 Sep 2023

This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments.

The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge

no code yet • 11 Mar 2023

This paper describes our NPU-ASLP system for the Audio-Visual Diarization and Recognition (AVDR) task in the Multi-modal Information based Speech Processing (MISP) 2022 Challenge.

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices

no code yet • Sensors 2023

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise.

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

no code yet • 10 Feb 2023

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems.