Speech Representation Learning
46 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in Speech Representation Learning
Libraries
Use these libraries to find Speech Representation Learning models and implementationsMost implemented papers
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation.
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders
We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech.
Unsupervised speech representation learning using WaveNet autoencoders
We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms.
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data
In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner.
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training
We integrate the proposed methods into the HuBERT framework.
An Unsupervised Autoregressive Model for Speech Representation Learning
This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations.
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
In particular, when compared to published models such as conformer-based wav2vec~2. 0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets.
Sampling strategies in Siamese Networks for unsupervised speech representation learning
We apply these results to pairs of words discovered using an unsupervised algorithm and show an improvement on state-of-the-art in unsupervised representation learning using siamese networks.
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7. 4 BLEU over 21 translation directions into English.
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.