no code implementations • • Tarek Sakakini, Jong Yoon Lee, Aditya Duri, Renato F.L. Azevedo, Victor Sadauskas, Kuangxiao Gu, Suma Bhat, Dan Morrow, James Graumlich, Saqib Walayat, Mark Hasegawa-Johnson, Thomas Huang, Ann Willemsen-Dunlap, Donald Halpin
We also show the enhanced accuracy of our system over directly-supervised neural methods in this low-resource setting.
Phonemes are defined by their relationship to words: changing a phoneme changes the word.
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks.
Designing equivariance as an inductive bias into deep-nets has been a prominent approach to build effective models, e. g., a convolutional neural network incorporates translation equivariance.
We demonstrate that our high-quality visualizations capture major types of family vocalization interactions, in categories indicative of mental, behavioral, and developmental health, for both labeled and unlabeled LB audio.
An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech.
We show that WavPrompt is a few-shot learner that can perform speech understanding tasks better than a naive text baseline.
SpeechSplit can perform aspect-specific voice conversion by disentangling speech into content, rhythm, pitch, and timbre using multiple autoencoders in an unsupervised manner.
In this paper, we 1) investigate the influence of different factors (i. e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language; 2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation; and 3) present different methods to build a phone inventory of an unseen language in an unsupervised way.
This paper defines fair principal component analysis (PCA) as minimizing the maximum mean discrepancy (MMD) between dimensionality-reduced conditional distributions of different protected classes.
In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.
Three sub-tasks are proposed: (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded.
Therefore, being able to reason over incomplete KGs for QA is a critical requirement in real-world applications that has not been addressed extensively in the literature.
Beyond the model, we also propose a metric on how to evaluate source separation with variable number of speakers.
Ranked #4 on Speech Separation on WSJ0-4mix
Finally, we achieve a higher level of interpretability by imposing OCCAM on the objects represented in the induced symbolic concept space.
Ranked #3 on Visual Question Answering on CLEVR
This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes.
Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences.
In scenarios where multiple speakers talk at the same time, it is important to be able to identify the talkers accurately.
Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies.
In this article, we provide a model to estimate a real-valued measure of the intelligibility of individual speech segments.
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.
Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice.
Convolutional neural network (CNN) for time series data implicitly assumes that the data are uniformly sampled, whereas many event-based and multi-modal data are nonuniform or have heterogeneous sampling rates.
We present software that, in only a few hours, transcribes forty hours of recorded speech in a surprise language, using only a few tens of megabytes of noisy text in that language, and a zero-resource grapheme to phoneme (G2P) table.
On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.
In this paper, the convergence properties of CTC are improved by incorporating acoustic landmarks.
Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental.
no code implementations • 16 Feb 2018 • Lucas Ondel, Pierre Godard, Laurent Besacier, Elin Larsen, Mark Hasegawa-Johnson, Odette Scharenborg, Emmanuel Dupoux, Lukas Burget, François Yvon, Sanjeev Khudanpur
Developing speech technologies for low-resource languages has become a very active research field over the last decade.
On the other hand, deep learning based enhancement approaches are able to learn complicated speech distributions and perform efficient inference, but they are unable to deal with variable number of input channels.
no code implementations • 14 Feb 2018 • Odette Scharenborg, Laurent Besacier, Alan Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stueker, Pierre Godard, Markus Mueller, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, Emmanuel Dupoux
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography.
The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios.
To provide a theory-based quantification of the architecture's advantages, we introduce a memory capacity measure, the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures.
Ranked #20 on Sequential Image Classification on Sequential MNIST
Mismatched transcriptions have been proposed as a mean to acquire probabilistic transcriptions from non-native speakers of a language. Prior work has demonstrated the value of these transcriptions by successfully adapting cross-lingual ASR systems for different tar-get languages.
We evaluate our techniques using mismatched transcriptions for Cantonese speech acquired from native English and Mandarin speakers.
Three consonant voicing classifiers were developed: (1) manually selected acoustic features anchored at a phonetic landmark, (2) MFCCs (either averaged across the segment or anchored at the landmark), and(3) acoustic features computed using a convolutional neural network (CNN).
In this paper, we propose a novel method for semantic image inpainting, which generates the missing content by conditioning on the available data.
In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including monaural speech separation, monaural singing voice separation, and speech denoising.
A major problem with dialectal Arabic speech recognition is due to the sparsity of speech resources.
In a second pass, a more restricted LM is generated for each audio segment, and unsupervised acoustic model adaptation is applied.