Acoustic/prosodic (a/p) entrainment has been associated with multiple positive social aspects of human-human conversations.
In this paper we describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages.
Jitter and shimmer measurements have shown to be carriers of voice quality and prosodic information which enhance the performance of tasks like speaker recognition, diarization or automatic speech recognition (ASR).
Furthermore, we study the influence of a language model -- which tends to correct non-standard sequences of words -- with the lack of language model to decode the hypothesis from the ASR.
Nowadays, research in speech technologies has gotten a lot out thanks to recently created public domain corpora that contain thousands of recording hours.
This work explores the application of Lambda networks, an alternative framework for capturing long-range interactions without attention, for the keyword spotting task.
This paper describes joint effort of BUT and Telef\'onica Research on development of Automatic Speech Recognition systems for Albayzin 2020 Challenge.
Keyword spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
Up to our knowledge, this is the first work combining such pitch and voice quality features with modern convolutional architectures, showing improvements up to 7% and 3% relative WER points, for the publicly available Spanish Common Voice and LibriSpeech 100h datasets, respectively.
In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives.
The use of photoplethysmogram signal (PPG) for heart and sleep monitoring is commonly found nowadays in smartphones and wrist wearables.
Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system.
Ranked #10 on Anomaly Detection on Unlabeled CIFAR-10 vs CIFAR-100
Sounds are an important source of information on our daily interactions with objects.
This means that inferences of statistical patterns of language in acoustics are biased by the arbitrary, language-dependent segmentation of the signal, and virtually precludes the possibility of making comparative studies between human voice and other animal communication systems.
We first show that during speech the energy is unevenly released and power-law distributed, reporting a universal robust Gutenberg-Richter-like law in speech.