Generative Spoken Language Modeling from Raw Audio

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Resynthesis LibriSpeech HuBERT-L6 PER 16.68 # 1
CER 11.85 # 2
MOS 3.49 # 2
Resynthesis LibriSpeech CPC PER 14.23 # 2
CER 8.29 # 1
MOS 3.54 # 1
Resynthesis LJSpeech CPC PER 8.74 # 2
CER 9.20 # 1
MOS 3.85 # 1
Resynthesis LJSpeech HuBERT-L6 PER 11.45 # 1
CER 11.02 # 2
MOS 3.69 # 2


No methods listed for this paper. Add relevant methods here