TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Lipreading	Lip Reading in the Wild	3D Conv + P3D-ResNet50 + TCN	Top-1 Accuracy	84.80	# 12
Audio-Visual Speech Recognition	LRS3-TED	EG-seq2seq	Word Error Rate (WER)	6.8	# 6
Lipreading	LRS3-TED	EG-seq2seq	Word Error Rate (WER)	57.8	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/discriminative-multi-modality-speech/audio-visual-speech-recognition-on-lrs3-ted)](https://paperswithcode.com/sota/audio-visual-speech-recognition-on-lrs3-ted?p=discriminative-multi-modality-speech)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/discriminative-multi-modality-speech/lipreading-on-lip-reading-in-the-wild)](https://paperswithcode.com/sota/lipreading-on-lip-reading-in-the-wild?p=discriminative-multi-modality-speech)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/discriminative-multi-modality-speech/lipreading-on-lrs3-ted)](https://paperswithcode.com/sota/lipreading-on-lrs3-ted?p=discriminative-multi-modality-speech)`

Discriminative Multi-modality Speech Recognition

CVPR 2020 · Bo Xu, Cheng Lu, Yandong Guo, Jacob Wang ·

Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates. After combining visual modality, ASR is upgraded to the multi-modality speech recognition (MSR). In this paper, we propose a two-stage speech recognition model. In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly. At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate. There are some other key contributions: we introduce a pseudo-3D residual convolution (P3D)-based visual front-end to extract more discriminative features; we upgrade the temporal convolution block from 1D ResNet with the temporal convolutional network (TCN), which is more suitable for the temporal tasks; the MSR sub-network is built on the top of Element-wise-Attention Gated Recurrent Unit (EleAtt-GRU), which is more effective than Transformer in long sequences. We conducted extensive experiments on the LRS3-TED and the LRW datasets. Our two-stage model (audio enhanced multi-modality speech recognition, AE-MSR) consistently achieves the state-of-the-art performance by a significant margin, which demonstrates the necessity and effectiveness of AE-MSR.

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract

Code

Add Remove Mark official

JackSyu/Discriminative-Multi-modali… official

JackSyu/AE-MSR

Tasks

Add Remove

Audio-Visual Speech Recognition

Lipreading

speech-recognition

Speech Recognition

Datasets

LRW LRS3-TED

Results from the Paper

Edit

Ranked #6 on Audio-Visual Speech Recognition on LRS3-TED (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Lipreading	Lip Reading in the Wild	3D Conv + P3D-ResNet50 + TCN	Top-1 Accuracy	84.80	# 12	Compare
Audio-Visual Speech Recognition	LRS3-TED	EG-seq2seq	Word Error Rate (WER)	6.8	# 6	Compare
Lipreading	LRS3-TED	EG-seq2seq	Word Error Rate (WER)	57.8	# 12	Compare

Methods

Add Remove

1x1 Convolution • Absolute Position Encodings • Adam • Average Pooling • Batch Normalization • Bottleneck Residual Block • BPE • Convolution • Dense Connections • Dropout • Global Average Pooling • Kaiming Initialization • Label Smoothing • Layer Normalization • Linear Layer • Max Pooling • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Block • Residual Connection • ResNet • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Discriminative Multi-modality Speech Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove