Search Results for author: Ruibo Fu

Found 30 papers, 6 papers with code

Towards Diverse and Efficient Audio Captioning via Diffusion Models

no code implementations14 Sep 2024 Manjie Xu, Chenxing Li, Xinyi Tu, Yong Ren, Ruibo Fu, Wei Liang, Dong Yu

We introduce Diffusion-based Audio Captioning (DAC), a non-autoregressive diffusion model tailored for diverse and efficient audio captioning.

Audio captioning Diversity +1

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

no code implementations14 Sep 2024 Chenxu Xiong, Ruibo Fu, Shuchen Shi, Zhengqi Wen, JianHua Tao, Tao Wang, Chenxing Li, Chunyu Qiang, Yuankun Xie, Xin Qi, Guanjun Li, Zizheng Yang

Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual-prompt audio generation using both text and audio references.

Audio Generation Style Transfer

Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge

no code implementations13 Aug 2024 Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Haonan Cheng, Long Ye

At first, we comprehensively investigate various CM on ASVspoof5, including data expansion, data augmentation, and self-supervised learning (SSL) features.

Audio Deepfake Detection Data Augmentation +3

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

no code implementations11 Aug 2024 Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, JianHua Tao

For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation

no code implementations7 Jul 2024 Ruibo Fu, Xin Qi, Zhengqi Wen, JianHua Tao, Tao Wang, Chunyu Qiang, Zhiyong Wang, Yi Lu, Xiaopeng Wang, Shuchen Shi, Yukun Liu, Xuefei Liu, Shuai Zhang

The results indicate that the ASRRL method significantly outperforms traditional fine-tuning approaches, achieving higher speaker similarity and better overall speech quality with limited reference speeches.

Sentence Text to Speech

Fake News Detection and Manipulation Reasoning via Large Vision-Language Models

no code implementations2 Jul 2024 Ruihan Jin, Ruibo Fu, Zhengqi Wen, Shuai Zhang, Yukun Liu, JianHua Tao

To support the research, we introduce a benchmark for fake news detection and manipulation reasoning, referred to as Human-centric and Fact-related Fake News (HFFN).

Binary Classification Fake News Detection +1

A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge

no code implementations22 Jun 2024 Xiaopeng Wang, Yi Lu, Xin Qi, Zhiyong Wang, Yuankun Xie, Shuchen Shi, Ruibo Fu

The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities, covering seven Indian languages with both male and female speakers.

Speech Synthesis Text to Speech +1

MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

1 code implementation15 Jun 2024 Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, JianHua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li

Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation.

AudioCaps Image Generation

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

no code implementations12 Jun 2024 Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, JianHua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

To effectively detect LLM-based deepfake audio, we focus on the core of the generation process, the conversion from neural codec to waveform.

Audio Deepfake Detection Audio Generation +4

Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy

no code implementations5 Jun 2024 Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, JianHua Tao

For effective OOD detection, we first explore current post-hoc OOD methods and propose NSD, a novel OOD approach in identifying novel deepfake algorithms through the similarity consideration of both feature and logits scores.

Audio Deepfake Detection DeepFake Detection +1

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

no code implementations1 Sep 2023 Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang

However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks.

Audio Classification Automatic Speech Recognition +6

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

no code implementations28 Jul 2023 Chunyu Qiang, Hao Li, Hao Ni, He Qu, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang

However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods.

Language Modelling Speech Synthesis +1

Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

no code implementations9 Jun 2023 Chenglong Wang, Jiangyan Yi, Xiaohui Zhang, JianHua Tao, Le Xu, Ruibo Fu

Self-supervised speech models are a rapidly developing research topic in fake audio detection.

Adaptive Fake Audio Detection with Low-Rank Model Squeezing

no code implementations8 Jun 2023 Xiaohui Zhang, Jiangyan Yi, JianHua Tao, Chenlong Wang, Le Xu, Ruibo Fu

During the inference stage, these adaptation matrices are combined with the existing model to generate the final prediction output.

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

no code implementations10 Jan 2023 Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, JianHua Tao

Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality.

Quantization Text to Speech +1

Emotion Selectable End-to-End Text-based Speech Editing

no code implementations20 Dec 2022 Tao Wang, Jiangyan Yi, Ruibo Fu, JianHua Tao, Zhengqi Wen, Chu Yuan Zhang

To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech.

Data Augmentation

Fully Automated End-to-End Fake Audio Detection

no code implementations20 Aug 2022 Chenglong Wang, Jiangyan Yi, JianHua Tao, Haiyang Sun, Xun Chen, Zhengkun Tian, Haoxin Ma, Cunhang Fan, Ruibo Fu

The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure.

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

no code implementations5 Mar 2022 Tao Wang, Ruibo Fu, Jiangyan Yi, JianHua Tao, Zhengqi Wen

We have also verified through experiments that this method can effectively control the noise components in the predicted speech and adjust the SNR of speech.

CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

3 code implementations21 Feb 2022 Tao Wang, Jiangyan Yi, Ruibo Fu, JianHua Tao, Zhengqi Wen

It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.

Few-Shot Learning Sentence

Cannot find the paper you are looking for? You can Submit a new open access paper.