no code implementations • 22 Feb 2025 • Yuguang Yang, Tongfei Chen, Haoyu Huang, Linlin Yang, Chunyu Xie, Dawei Leng, Xianbin Cao, Baochang Zhang
Zero-shot medical detection can further improve detection performance without relying on annotated medical images even upon the fine-tuned model, showing great clinical value.
no code implementations • 5 Feb 2025 • Jixun Yao, Yuguang Yang, Yu Pan, Yuan Feng, Ziqian Ning, Jianhao Ye, Hongbin Zhou, Lei Xie
In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems.
no code implementations • 6 Dec 2024 • Jixun Yao, Yuguang Yang, Yu Pan, Ziqian Ning, Jiaohao Ye, Hongbin Zhou, Lei Xie
Zero-shot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content.
no code implementations • 4 Nov 2024 • Yu Pan, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao
Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content.
no code implementations • 18 Oct 2024 • Bin Lin, Yanzhen Yu, Jianhao Ye, Ruitao Lv, Yuguang Yang, Ruoye Xie, Pan Yu, Hongbin Zhou
We discovered that these issues stem from limitations in motion representation and the lack of fine-grained control over facial expressions.
no code implementations • 2 Oct 2024 • Yuguang Yang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye, Hongbin Zhou, Lei Xie, Lei Ma, Jianjun Zhao
Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities.
no code implementations • 18 Sep 2024 • Sijing Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Yu Pan, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jixun Yao, Quanlei Yan, Yuguang Yang, Jianhao Ye, JingJing Yin, Yanzhen Yu, Huimin Zhang, Xiang Zhang, Guangcheng Zhao, Hongbin Zhou, Pengpeng Zou
In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production.
no code implementations • 16 Jul 2024 • Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, Yuguang Yang, Yu Pan, Lei Xie
Meanwhile, we propose a straightforward anonymization strategy that employs empty embedding with zero values to simulate the speaker identity concealment process, eliminating the need for conversion to a pseudo-speaker identity and thereby reducing the complexity of speaker anonymization process.
1 code implementation • 29 May 2024 • Yuguang Yang, Runtang Guo, Sheng Wu, Yimi Wang, Linlin Yang, Bo Fan, Jilong Zhong, Juan Zhang, Baochang Zhang
Interpreting complex deep networks, notably pre-trained vision-language models (VLMs), is a formidable challenge.
no code implementations • 3 May 2024 • Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao
The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER).
no code implementations • 3 Apr 2024 • Yu Pan, Xiang Zhang, Yuguang Yang, Jixun Yao, Yanni Hu, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao
In this paper, we propose PSCodec, a series of neural speech codecs based on prompt encoders, comprising PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN, which are capable of delivering high-performance speech reconstruction with low bandwidths.
no code implementations • 28 Sep 2023 • Xiang Lyu, Yuhang Cao, Qing Wang, JingJing Yin, Yuguang Yang, Pengpeng Zou, Yanni Hu, Heng Lu
Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts.
no code implementations • 17 Sep 2023 • Jixun Yao, Yuguang Yang, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan, JingJing Yin, Hongbin Zhou, Heng Lu, Lei Xie
In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts.
no code implementations • 8 Aug 2023 • Yu Pan, Yuguang Yang, Yuheng Huang, Jixun Yao, JingJing Yin, Yanni Hu, Heng Lu, Lei Ma, Jianjun Zhao
Despite notable progress, speech emotion recognition (SER) remains challenging due to the intricate and ambiguous nature of speech emotion, particularly in wild world.
no code implementations • 13 Jun 2023 • Yu Pan, Yanni Hu, Yuguang Yang, Wen Fei, Jixun Yao, Heng Lu, Lei Ma, Jianjun Zhao
Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER).
1 code implementation • 11 Jun 2023 • Yuguang Yang, Yiming Wang, Shupeng Geng, Runqi Wang, Yimi Wang, Sheng Wu, Baochang Zhang
The emergence of cross-modal foundation models has introduced numerous approaches grounded in text-image retrieval.
1 code implementation • journal 2023 • Yuguang Yang, Shupeng Geng, Baochang Zhang, Juan Zhang, Zheng Wang, Yong Zhang & David Doermann
However, long term prediction horizon exposes the non-stationarity of series data, which deteriorates the performance of existing approaches.
no code implementations • 27 May 2023 • Yuguang Yang, Runtang Guo, Sheng Wu, Yimi Wang, Juan Zhang, Xuan Gong, Baochang Zhang
Although the Class Activation Map (CAM) is widely used to interpret deep model predictions by highlighting object location, it fails to provide insight into the salient features used by the model to make decisions.
1 code implementation • 15 Mar 2023 • Yuguang Yang, Yu Pan, JingJing Yin, Jiangyu Han, Lei Ma, Heng Lu
SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 5 Dec 2022 • Yuguang Yang, Yu Pan, JingJing Yin, Heng Lu
This paper proposes a Learnable Multiplicative absolute position Embedding based Conformer (LMEC).
1 code implementation • 23 Feb 2022 • Hua Shen, Yuguang Yang, Guoli Sun, Ryan Langman, Eunjung Han, Jasha Droppo, Andreas Stolcke
This is observed especially with underrepresented demographic groups sharing similar voice characteristics.
no code implementations • 7 Feb 2022 • Metehan Cekic, Ruirui Li, Zeya Chen, Yuguang Yang, Andreas Stolcke, Upamanyu Madhow
Speaker recognition, recognizing speaker identities based on voice alone, enables important downstream applications, such as personalization and authentication.
no code implementations • 2 Feb 2022 • Aparna Khare, Eunjung Han, Yuguang Yang, Andreas Stolcke
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 18 Sep 2021 • Yuguang Yang, Yu Pan, Xin Dong, Minqiang Xu
Second, we design a novel model inference scheme based on RepVGG which can efficiently improve the QbE search quality.
no code implementations • 6 Sep 2021 • Zhenning Tan, Yuguang Yang, Eunjung Han, Andreas Stolcke
Second, a scoring function is applied between a runtime utterance and each speaker profile.
no code implementations • 25 Nov 2019 • Yuguang Yang
SDQL exploits the linear stage structure by approximating the Q function via a collection of deep Q sub-networks stacking along an axis marking the stage-wise progress of the whole task.
no code implementations • 26 Jun 2019 • Yuguang Yang, Michael A. Bevan, Bo Li
Equipping active colloidal robots with intelligence such that they can efficiently navigate in unknown complex environments could dramatically impact their use in emerging applications like precision surgery and targeted drug delivery.