As this problem originates from the conventional scheme used when creating a POS tagging corpus, we propose an improvement to the existing scheme, which makes it friendlier to generative tasks.
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker.
Recently, the advance in deep learning has brought a considerable improvement in the end-to-end speech recognition field, simplifying the traditional pipeline while producing promising results.
Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers.
For training a few-shot keyword spotting (FS-KWS) model, a large labeled dataset containing massive target keywords has known to be essential to generalize to arbitrary target keywords with only a few enrollment samples.
The experimental results show that fine-tuning with a disentanglement framework on a existing pre-trained model is valid and can further improve performance.
Pre-training with self-supervised models, such as Hidden-unit BERT (HuBERT) and wav2vec 2. 0, has brought significant improvements in automatic speech recognition (ASR).
The SKA mechanism allows each convolutional layer to adaptively select the kernel size in a data-driven fashion.
The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization.
In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end.
Extending this supervised scheme further, we introduce a new type of teacher model for connectionist temporal classification (CTC)-based sequence models, namely Oracle Teacher, that leverages both the source inputs and the output labels as the teacher model's input.
Most speech-to-text (S2T) translation studies use English speech as a source, which makes it difficult for non-English speakers to take advantage of the S2T technologies.
Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency.
In this work, we propose StyleTagging-TTS (ST-TTS), a novel expressive TTS model that utilizes a style tag written in natural language.
Photoplethysmogram (PPG) signal-based blood pressure (BP) estimation is a promising candidate for modern BP measurements, as PPG signals can be easily obtained from wearable devices in a non-invasive manner, allowing quick BP measurement.
This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020.
Audio and Speech Processing Sound
In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss.
To reduce this computational burden, knowledge distillation (KD), which is a popular model compression method, has been used to transfer knowledge from a deep and complex model (teacher) to a shallower and simpler model (student).
Flow-based generative models are composed of invertible transformations between two random variables of the same dimension.
Ranked #1 on Point Cloud Generation on ShapeNet Airplane
In recent years, various flow-based generative models have been proposed to generate high-fidelity waveforms in real-time.
We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies.
Code-mixed grapheme-to-phoneme (G2P) conversion is a crucial issue for modern speech recognition and synthesis task, but has been seldom investigated in sentence-level in literature.
Assessing the similarity of sentences and detecting paraphrases is an essential task both in theory and practice, but achieving a reliable dataset requires high resource.
Modern dialog managers face the challenge of having to fulfill human-level conversational skills as part of common user expectations, including but not limited to discourse with no clear objective.
Analyzing how human beings resolve syntactic ambiguity has long been an issue of interest in the field of linguistics.
Different from the writing systems of many Romance and Germanic languages, some languages or language families show complex conjunct forms in character composition.
However, detection and evaluation of gender bias in the machine translation systems are not yet thoroughly investigated, for the task being cross-lingual and challenging to define.
This paper suggests a system which identifies the inherent intention of a spoken utterance given its transcript, in some cases using auxiliary acoustic features.
For readability and disambiguation of the written text, appropriate word segmentation is recommended for documentation, and it also holds for the digitized texts.
Intention identification is a core issue in dialog management.
This paper proposes a novel feature extraction process for SemEval task 3: Irony detection in English tweets.