Specifically, we compare discrete and soft speech units as input features.
In this paper, we first show that the per-utterance mean of CPC features captures speaker information to a large extent.
We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units.
The idea is to learn a representation of speech by predicting future acoustic units.
Ranked #1 on Acoustic Unit Discovery on ZeroSpeech 2019 English
The environment's dynamics are learned from limited training data and can be reused in new task instances without retraining.
Our results therefore suggest that, in the shallow-to-moderate depth setting, critical initialisation provides zero performance gains when compared to off-critical initialisations and that searching for off-critical initialisations that might improve training speed or generalisation, is likely to be a fruitless endeavour.
no code implementations • 16 Apr 2019 • Ryan Eloff, André Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan van Biljon, Ewald van der Westhuizen, Lisa van Staden, Herman Kamper
For our submission to the ZeroSpeech 2019 challenge, we apply discrete latent-variable neural networks to unlabelled speech and use the discovered units for speech synthesis.
An important property for lifelong-learning agents is the ability to combine existing skills to solve unseen tasks.