This paper focuses on the loss functions that embed the error between two poses to perform deep learning based camera pose regression.
We investigate the performance on phoneme categorization and phoneme and word segmentation of several self-supervised learning (SSL) methods based on Contrastive Predictive Coding (CPC).
We present a number of low-resource approaches to the tasks of the Zero Resource Speech Challenge 2021.
We investigate the possibility of forcing a self-supervised model trained using a contrastive predictive loss to extract slowly varying latent representations.
Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech.
We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs.
Deep Learning systems have shown tremendous accuracy in image classification, at the cost of big image datasets.
The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises.
A system is presented that segments, clusters and predicts musical audio in an unsupervised manner, adjusting the number of (timbre) clusters instantaneously to the audio input.