no code implementations • 12 Jan 2025 • Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, JianHua Tao, Xuefei Liu, Guanjun Li
Multimodal fake news detection is essential for maintaining the authenticity of Internet multimedia information.
1 code implementation • 11 Jan 2025 • Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Songjun Cao, Long Ma, Chenxing Li, Haonnan Cheng, Long Ye
Current research in audio deepfake detection is gradually transitioning from binary classification to multi-class tasks, referred as audio deepfake source tracing task.
no code implementations • 24 Nov 2024 • Haojie Zhang, Zhihao Liang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Chenxing Li, JianHua Tao, Yaling Liang
Then we propose a suitable solution according to the modality differences of image, audio, and video generation.
no code implementations • 18 Sep 2024 • Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, JianHua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li
In recent years, speech diffusion models have advanced rapidly.
no code implementations • 14 Sep 2024 • Chenxu Xiong, Ruibo Fu, Shuchen Shi, Zhengqi Wen, JianHua Tao, Tao Wang, Chenxing Li, Chunyu Qiang, Yuankun Xie, Xin Qi, Guanjun Li, Zizheng Yang
Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual-prompt audio generation using both text and audio references.
1 code implementation • 20 Aug 2024 • Yuankun Xie, Chenxu Xiong, Xiaopeng Wang, Zhiyong Wang, Yi Lu, Xin Qi, Ruibo Fu, Yukun Liu, Zhengqi Wen, JianHua Tao, Guanjun Li, Long Ye
Currently, Audio Language Models (ALMs) are rapidly advancing due to the developments in large language models and audio neural codecs.
no code implementations • 13 Aug 2024 • Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Haonan Cheng, Long Ye
At first, we comprehensively investigate various CM on ASVspoof5, including data expansion, data augmentation, and self-supervised learning (SSL) features.
no code implementations • 11 Aug 2024 • Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, JianHua Tao
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 17 Jul 2024 • Cong Cai, Shan Liang, Xuefei Liu, Kang Zhu, Zhengqi Wen, JianHua Tao, Heng Xie, Jizhou Cui, Yiming Ma, Zhenhua Cheng, Hanzhe Xu, Ruibo Fu, Bin Liu, Yongwei Li
We believe that MDPE will become a valuable resource for promoting research in the field of affective computing.
no code implementations • 7 Jul 2024 • Ruibo Fu, Xin Qi, Zhengqi Wen, JianHua Tao, Tao Wang, Chunyu Qiang, Zhiyong Wang, Yi Lu, Xiaopeng Wang, Shuchen Shi, Yukun Liu, Xuefei Liu, Shuai Zhang
The results indicate that the ASRRL method significantly outperforms traditional fine-tuning approaches, achieving higher speaker similarity and better overall speech quality with limited reference speeches.
no code implementations • 2 Jul 2024 • Ruihan Jin, Ruibo Fu, Zhengqi Wen, Shuai Zhang, Yukun Liu, JianHua Tao
To support the research, we introduce a benchmark for fake news detection and manipulation reasoning, referred to as Human-centric and Fact-related Fake News (HFFN).
1 code implementation • 15 Jun 2024 • Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, JianHua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li
Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation.
no code implementations • 12 Jun 2024 • Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, JianHua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi
To effectively detect LLM-based deepfake audio, we focus on the core of the generation process, the conversion from neural codec to waveform.
no code implementations • 5 Jun 2024 • Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, JianHua Tao
For effective OOD detection, we first explore current post-hoc OOD methods and propose NSD, a novel OOD approach in identifying novel deepfake algorithms through the similarity consideration of both feature and logits scores.
1 code implementation • 8 May 2024 • Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, JianHua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods.
no code implementations • 2 Mar 2023 • Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang, Zhengqi Wen, Dan Zhang, Zhao Lv
To address this problem, we propose using the deepest network instruct shallow network for enhancing shallow networks.
no code implementations • 10 Jan 2023 • Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, JianHua Tao
Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality.
no code implementations • 20 Dec 2022 • Tao Wang, Jiangyan Yi, Ruibo Fu, JianHua Tao, Zhengqi Wen, Chu Yuan Zhang
To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech.
no code implementations • 2 Aug 2022 • Jun Xue, Cunhang Fan, Zhao Lv, JianHua Tao, Jiangyan Yi, Chengshi Zheng, Zhengqi Wen, Minmin Yuan, Shegang Shao
Meanwhile, to make full use of the phase and full-band information, we also propose to use real and imaginary spectrogram features as complementary input features and model the disjoint subbands separately.
no code implementations • 5 Mar 2022 • Tao Wang, Ruibo Fu, Jiangyan Yi, JianHua Tao, Zhengqi Wen
We have also verified through experiments that this method can effectively control the noise components in the predicted speech and adjust the SNR of speech.
3 code implementations • 21 Feb 2022 • Tao Wang, Jiangyan Yi, Ruibo Fu, JianHua Tao, Zhengqi Wen
It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
no code implementations • 17 Feb 2022 • Jiangyan Yi, Ruibo Fu, JianHua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Xiaohui Zhang, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu
Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021.
no code implementations • 16 Feb 2022 • Tao Wang, Ruibo Fu, Jiangyan Yi, JianHua Tao, Zhengqi Wen
Firstly, we propose a global duration control attention mechanism for the SVS model.
no code implementations • 7 Apr 2021 • Zhengkun Tian, Jiangyan Yi, Ye Bai, JianHua Tao, Shuai Zhang, Zhengqi Wen
It takes a lot of computation and time to predict the blank tokens, but only the non-blank tokens will appear in the final output sequence.
1 code implementation • 4 Apr 2021 • Zhengkun Tian, Jiangyan Yi, JianHua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen, Xuefei Liu
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT), which improves the performance and accelerating the convergence of the NAR model by learning prior knowledge from a parameters-sharing AR model.
no code implementations • 15 Feb 2021 • Ye Bai, Jiangyan Yi, JianHua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang
Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once).
no code implementations • 11 Nov 2020 • Cunhang Fan, Bin Liu, JianHua Tao, Jiangyan Yi, Zhengqi Wen, Leichao Song
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
no code implementations • 9 Nov 2020 • Cunhang Fan, Jiangyan Yi, JianHua Tao, Zhengkun Tian, Bin Liu, Zhengqi Wen
The joint training framework for speech enhancement and recognition methods have obtained quite good performances for robust end-to-end automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 28 Oct 2020 • Zhengkun Tian, Jiangyan Yi, Ye Bai, JianHua Tao, Shuai Zhang, Zhengqi Wen
Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder and the two-stage inference method into the streaming CTC model.
no code implementations • 28 Oct 2020 • Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, JianHua Tao, Zhengqi Wen
In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 16 May 2020 • Zhengkun Tian, Jiangyan Yi, Jian-Hua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen
To address this problem and improve the inference speed, we propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition, which introduces a CTC module to predict the length of the target sequence and accelerate the convergence.
no code implementations • 11 May 2020 • Ye Bai, Jiangyan Yi, Jian-Hua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang
Without beam-search, the one-pass propagation much reduces inference time cost of LASO.
no code implementations • 6 Apr 2020 • Cunhang Fan, Jian-Hua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen
In this paper, we propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features, which is based on the deep clustering (DC).
no code implementations • 17 Mar 2020 • Cunhang Fan, Jian-Hua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen, Xuefei Liu
Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech.
no code implementations • 5 Feb 2020 • Cunhang Fan, Bin Liu, Jian-Hua Tao, Jiangyan Yi, Zhengqi Wen
Specifically, we apply the deep clustering network to extract deep embedding features.
no code implementations • 6 Dec 2019 • Zhengkun Tian, Jiangyan Yi, Ye Bai, Jian-Hua Tao, Shuai Zhang, Zhengqi Wen
Once a fixed-length chunk of the input sequence is processed by the encoder, the decoder begins to predict symbols immediately.
no code implementations • 4 Dec 2019 • Ye Bai, Jiangyan Yi, Jian-Hua Tao, Zhengqi Wen, Zhengkun Tian, Shuai Zhang
To alleviate the above two issues, we propose a unified method called LST (Learn Spelling from Teachers) to integrate knowledge into an AED model from the external text-only data and leverage the whole context in a sentence.
no code implementations • 28 Sep 2019 • Zhengkun Tian, Jiangyan Yi, Jian-Hua Tao, Ye Bai, Zhengqi Wen
Furthermore, a path-aware regularization is proposed to assist SA-T to learn alignments and improve the performance.
no code implementations • 23 Jul 2019 • Cunhang Fan, Bin Liu, Jian-Hua Tao, Jiangyan Yi, Zhengqi Wen
Firstly, a DC network is trained to extract deep embedding features, which contain each source's information and have an advantage in discriminating each target speakers.
1 code implementation • 18 Jul 2019 • Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K. Soong, Zhengqi Wen, Jian-Hua Tao
Experimental results show our proposed methods especially the second one (bidirectional decoder regularization), leads a significantly improvement on both robustness and overall naturalness, as outperforming baseline (the revised version of Tacotron2) with a MOS gap of 0. 14 in a challenging test, and achieving close to human quality (4. 42 vs. 4. 49 in MOS) on general test.
no code implementations • 13 Jul 2019 • Ye Bai, Jiangyan Yi, Jian-Hua Tao, Zhengkun Tian, Zhengqi Wen
Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial.
no code implementations • 20 Feb 2018 • Jiangyan Yi, Jian-Hua Tao, Zhengqi Wen, Bin Liu
The close-talking model is called the teacher model.
no code implementations • 28 Mar 2016 • Linlin Chao, Jian-Hua Tao, Minghao Yang, Ya Li, Zhengqi Wen
The other one is locating and re-weighting the perception attentions in the whole audio-visual stream for better recognition.