Speech Recognition

wav2vec Unsupervised

Introduced by Baevski et al. in Unsupervised Speech Recognition

wav2vec-U is an unsupervised method to train speech recognition models without any labeled data. It leverages self-supervised speech representations to segment unlabeled language and learn a mapping from these representations to phonemes via adversarial training.

Specifically, we learn self-supervised representations with wav2vec 2.0 on unlabeled speech audio, then identify clusters in the representations with k-means to segment the audio data. Next, we build segment representations by mean pooling the wav2vec 2.0 representations, performing PCA and a second mean pooling step between adjacent segments. This is input to the generator which outputs a phoneme sequence that is fed to the discriminator, similar to phonemized unlabeled text to perform adversarial training.

Source: Unsupervised Speech Recognition

Papers


Paper Code Results Date Stars

Components


Component Type
k-Means Clustering
Clustering

Categories