We formalise this task and call it visually prompted keyword localisation (VPKL): given an image of a keyword, detect and predict where in an utterance the keyword occurs.
In this paper, we first show that the per-utterance mean of CPC features captures speaker information to a large extent.
We propose direct multimodal few-shot models that learn a shared embedding space of spoken words and images from only a few paired examples.
The idea is to learn a representation of speech by predicting future acoustic units.
Ranked #1 on Acoustic Unit Discovery on ZeroSpeech 2019 English
no code implementations • 16 Apr 2019 • Ryan Eloff, André Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan van Biljon, Ewald van der Westhuizen, Lisa van Staden, Herman Kamper
For our submission to the ZeroSpeech 2019 challenge, we apply discrete latent-variable neural networks to unlabelled speech and use the discovered units for speech synthesis.