This paper proposes a talking face generation method named "CP-EB" that takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking embedding.
The Retrieval Question Answering (ReQA) task employs the retrieval-augmented framework, composed of a retriever and generator.
Due to the powerful capabilities demonstrated by large language model (LLM), there has been a recent surge in efforts to integrate them with AI agents to enhance their performance.
Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios.
Generating realistic talking faces is a complex and widely discussed task with numerous applications.
The rise of the phenomenon of the "right to be forgotten" has prompted research on machine unlearning, which grants data owners the right to actively withdraw data that has been used for model training, and requires the elimination of the contribution of that data to the model.
In the realm of Large Language Models, the balance between instruction data quality and quantity has become a focal point.
Conversational Question Answering (CQA) is a challenging task that aims to generate natural answers for conversational flow questions.
Chinese Automatic Speech Recognition (ASR) error correction presents significant challenges due to the Chinese language's unique features, including a large character set and borderless, morpheme-based structure.
In order to construct or extend entity-centric and event-centric knowledge graphs (KG and EKG), the information extraction (IE) annotation toolkit is essential.
Deep neural retrieval models have amply demonstrated their power but estimating the reliability of their predictions remains challenging.
Deep neural networks have achieved remarkable performance in retrieval-based dialogue systems, but they are shown to be ill calibrated.
Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected.
Because of predicting all the target tokens in parallel, the non-autoregressive models greatly improve the decoding efficiency of speech recognition compared with traditional autoregressive models.
Using deep learning methods to classify EEG signals can accurately identify people's emotions.
Zero-shot information extraction (IE) aims to build IE systems from the unannotated text.
Metaverse expands the physical world to a new dimension, and the physical environment and Metaverse environment can be directly connected and entered.
In this paper, we proposed Adapitch, a multi-speaker TTS method that makes adaptation of the supervised module with untranscribed data.
In this work, we proposed two kinds of masking approaches: (1) speech-level masking, making the model to mask more speech segments than silence segments, (2) phoneme-level masking, forcing the model to mask the whole frames of the phoneme, instead of phoneme pieces.
Most previous neural text-to-speech (TTS) methods are mainly based on supervised learning methods, which means they depend on a large training dataset and hard to achieve comparable performance under low-resource conditions.
Recent advances in pre-trained language models have improved the performance for text classification tasks.
We also find that in joint CTC-Attention ASR model, decoder is more sensitive to linguistic information than acoustic information.
Since the beginning of the COVID-19 pandemic, remote conferencing and school-teaching have become important tools.
Buddhism is an influential religion with a long-standing history and profound philosophy.
Nonparallel multi-domain voice conversion methods such as the StarGAN-VCs have been widely applied in many scenarios.
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic.
In this paper, a novel voice conversion framework, named $\boldsymbol T$ext $\boldsymbol G$uided $\boldsymbol A$utoVC(TGAVC), is proposed to more effectively separate content and timbre from speech, where an expected content embedding produced based on the text transcriptions is designed to guide the extraction of voice content.
The existing models mostly established a bottleneck (BN) layer by pre-training on a large source language, and transferring to the low resource target language.
In our experiments, with augmentation based unsupervised learning, our KWS model achieves better performance than other unsupervised methods, such as CPC, APC, and MPC.
In this paper, we propose a novel method by directly extracting the coreference and omission relationship from the self-attention weight matrix of the transformer instead of word embeddings and edit the original text accordingly to generate the complete utterance.
The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history.
Voice conversion(VC) aims to convert one speaker's voice to generate a new speech as it is said by another speaker.
This paper investigates a novel task of talking face video generation solely from speeches.
End-to-end speech recognition systems usually require huge amounts of labeling resource, while annotating the speech data is complicated and expensive.
We evaluated the proposed methods on phoneme classification and speaker recognition tasks.
We propose a novel network structure, called Memory-Self-Attention (MSA) Transducer.
Slot filling and intent detection have become a significant theme in the field of natural language understanding.
To verify its universality over languages, we apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages.
Graph-to-sequence model is proposed and formed by a graph encoder and an attentional decoder.
In this paper, an efficient network, named location-variable convolution, is proposed to model the dependencies of waveforms.
The structure of our model are maintained concise to be implemented for real-time applications.
The MLNET leveraged multi-branches to extract multiple contextual speech information and investigated an effective attention block to weight the most crucial parts of the context for final classification.
However, the increased complexity of a model can also introduce high risk of over-fitting, which is a major challenge in SLU tasks due to the limitation of available data.
Recent neural speech synthesis systems have gradually focused on the control of prosody to improve the quality of synthesized speech, but they rarely consider the variability of prosody and the correlation between prosody and semantics together.
The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition.
Most singer identification methods are processed in the frequency domain, which potentially leads to information loss during the spectral transformation.
Targeting at both high efficiency and performance, we propose AlignTTS to predict the mel-spectrum in parallel.
This paper leverages the graph-to-sequence method in neural text-to-speech (GraphTTS), which maps the graph embedding of the input sequence to spectrograms.