However, Quantum Neural Network (QNN) running on low-qubit quantum devices would be difficult since it is based on Variational Quantum Circuit (VQC), which requires many qubits.
Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers.
In this work, we propose a novel task-adaptive module which is easy to plant into any metric-based few-shot learning frameworks.
In this paper, we aim to evaluate and enhance the robustness of G2P models.
This paper investigates a novel task of talking face video generation solely from speeches.
Text to speech (TTS) is a crucial task for user interaction, but TTS model training relies on a sizable set of high-quality original datasets.
We add an activation regularizer and a virtual interpolation method to improve the data generation efficiency.
We borrow the idea of neural architecture search(NAS) for the textindependent speaker verification task.