TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Lipreading	Lip Reading in the Wild	MoCo + Wav2Vec by SJTU LUMIA	Top-1 Accuracy	85.0	# 11
Automatic Speech Recognition (ASR)	LRS2	MoCo + wav2vec (w/o extLM)	Test WER	2.7	# 2
Lipreading	LRS2	MoCo + wav2vec (w/o extLM)	Word Error Rate (WER)	43.2%	# 8
Audio-Visual Speech Recognition	LRS2	MoCo + wav2vec (w/o extLM)	Test WER	2.6	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/leveraging-uni-modal-self-supervised-learning-1/automatic-speech-recognition-on-lrs2)](https://paperswithcode.com/sota/automatic-speech-recognition-on-lrs2?p=leveraging-uni-modal-self-supervised-learning-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/leveraging-uni-modal-self-supervised-learning-1/audio-visual-speech-recognition-on-lrs2)](https://paperswithcode.com/sota/audio-visual-speech-recognition-on-lrs2?p=leveraging-uni-modal-self-supervised-learning-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/leveraging-uni-modal-self-supervised-learning-1/lipreading-on-lrs2)](https://paperswithcode.com/sota/lipreading-on-lrs2?p=leveraging-uni-modal-self-supervised-learning-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/leveraging-uni-modal-self-supervised-learning-1/lipreading-on-lip-reading-in-the-wild)](https://paperswithcode.com/sota/lipreading-on-lip-reading-in-the-wild?p=leveraging-uni-modal-self-supervised-learning-1)`

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

ACL 2022 · Xichen Pan, Peiyu Chen, Yichen Gong, Helong Zhou, Xinbing Wang, Zhouhan Lin ·

Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pre-trained models into a multimodal scenario remains underexplored. In this work, we successfully leverage unimodal self-supervised learning to promote the multimodal AVSR. In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding. We show that both components inherited from unimodal self-supervised learning cooperate well, resulting in that the multimodal framework yields competitive results through fine-tuning. Our model is experimentally validated on both word-level and sentence-level tasks. Especially, even without an external language model, our proposed model raises the state-of-the-art performances on the widely accepted Lip Reading Sentences 2 (LRS2) dataset by a large margin, with a relative improvement of 30%.

PDF Abstract ACL 2022 PDF ACL 2022 Abstract

Code

Add Remove Mark official

lumia-group/leveraging-self-supervi… official

Tasks

Add Remove

Audio-Visual Speech Recognition

Automatic Speech Recognition (ASR)

Language Modelling

Lipreading

Lip Reading

Self-Supervised Learning

Sentence

speech-recognition

Speech Recognition

Visual Speech Recognition

Datasets

ImageNet

LibriSpeech

LRW Libri-Light

LRS2

Results from the Paper

Edit

Ranked #2 on Automatic Speech Recognition (ASR) on LRS2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Lipreading	Lip Reading in the Wild	MoCo + Wav2Vec by SJTU LUMIA	Top-1 Accuracy	85.0	# 11	Compare
Automatic Speech Recognition (ASR)	LRS2	MoCo + wav2vec (w/o extLM)	Test WER	2.7	# 2	Compare
Lipreading	LRS2	MoCo + wav2vec (w/o extLM)	Word Error Rate (WER)	43.2%	# 8	Compare
Audio-Visual Speech Recognition	LRS2	MoCo + wav2vec (w/o extLM)	Test WER	2.6	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove