TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Speech Recognition	LibriSpeech test-clean	HuBERT with Libri-Light	Word Error Rate (WER)	1.8	# 9
Speech Recognition	LibriSpeech test-other	HuBERT with Libri-Light	Word Error Rate (WER)	2.9	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hubert-self-supervised-speech-representation/speech-recognition-on-librispeech-test-other)](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-other?p=hubert-self-supervised-speech-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hubert-self-supervised-speech-representation/speech-recognition-on-librispeech-test-clean)](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean?p=hubert-self-supervised-speech-representation)`

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

14 Jun 2021 · Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed ·

Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.

PDF Abstract

Code

Add Remove Mark official

huggingface/transformers official

124,889

pytorch/fairseq official

29,233

PaddlePaddle/PaddleSpeech

10,131

bshall/hubert

↳ Quickstart in

Colab

286

huseinzol05/malaya-speech

219

See all 8 implementations

Tasks

Add Remove

Clustering

Language Modelling

Representation Learning

Speech Recognition

Datasets

LibriSpeech Libri-Light

Results from the Paper

Add Remove

Ranked #4 on Speech Recognition on LibriSpeech test-other

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Uses Extra Training Data	Benchmark
Speech Recognition	LibriSpeech test-clean	HuBERT with Libri-Light	Word Error Rate (WER)	1.8	# 9		Compare
Speech Recognition	LibriSpeech test-other	HuBERT with Libri-Light	Word Error Rate (WER)	2.9	# 4		Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay • WordPiece

Edit Social Preview

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove