Search Results for author: Yuguang Yang

Found 27 papers, 6 papers with code

Prompt as Knowledge Bank: Boost Vision-language model via Structural Representation for zero-shot medical detection

no code implementations22 Feb 2025 Yuguang Yang, Tongfei Chen, Haoyu Huang, Linlin Yang, Chunyu Xie, Dawei Leng, Xianbin Cao, Baochang Zhang

Zero-shot medical detection can further improve detection performance without relying on annotated medical images even upon the fine-tuned model, showing great clinical value.

Language Modeling Language Modelling

Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech

no code implementations5 Feb 2025 Jixun Yao, Yuguang Yang, Yu Pan, Yuan Feng, Ziqian Ning, Jianhao Ye, Hongbin Zhou, Lei Xie

In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems.

Language Modeling Language Modelling +1

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

no code implementations6 Dec 2024 Jixun Yao, Yuguang Yang, Yu Pan, Ziqian Ning, Jiaohao Ye, Hongbin Zhou, Lei Xie

Zero-shot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content.

Voice Conversion

CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching

no code implementations4 Nov 2024 Yu Pan, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content.

Speaker Verification Voice Conversion

Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization

no code implementations18 Oct 2024 Bin Lin, Yanzhen Yu, Jianhao Ye, Ruitao Lv, Yuguang Yang, Ruoye Xie, Pan Yu, Hongbin Zhou

We discovered that these issues stem from limitations in motion representation and the lack of fine-grained control over facial expressions.

Portrait Animation

Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling

no code implementations2 Oct 2024 Yuguang Yang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye, Hongbin Zhou, Lei Xie, Lei Ma, Jianjun Zhao

Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities.

Voice Conversion

MUSA: Multi-lingual Speaker Anonymization via Serial Disentanglement

no code implementations16 Jul 2024 Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, Yuguang Yang, Yu Pan, Lei Xie

Meanwhile, we propose a straightforward anonymization strategy that employs empty embedding with zero values to simulate the speaker identity concealment process, eliminating the need for conversion to a pseudo-speaker identity and thereby reducing the complexity of speaker anonymization process.

Disentanglement

PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders

no code implementations3 Apr 2024 Yu Pan, Xiang Zhang, Yuguang Yang, Jixun Yao, Yanni Hu, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

In this paper, we propose PSCodec, a series of neural speech codecs based on prompt encoders, comprising PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN, which are capable of delivering high-performance speech reconstruction with low bandwidths.

Representation Learning Speaker Verification +5

PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

no code implementations28 Sep 2023 Xiang Lyu, Yuhang Cao, Qing Wang, JingJing Yin, Yuguang Yang, Pengpeng Zou, Yanni Hu, Heng Lu

Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts.

Action Detection Activity Detection +3

PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

no code implementations17 Sep 2023 Jixun Yao, Yuguang Yang, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan, JingJing Yin, Hongbin Zhou, Heng Lu, Lei Xie

In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts.

Voice Conversion

MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition

no code implementations8 Aug 2023 Yu Pan, Yuguang Yang, Yuheng Huang, Jixun Yao, JingJing Yin, Yanni Hu, Heng Lu, Lei Ma, Jianjun Zhao

Despite notable progress, speech emotion recognition (SER) remains challenging due to the intricate and ambiguous nature of speech emotion, particularly in wild world.

Attribute Cross-corpus +2

GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

no code implementations13 Jun 2023 Yu Pan, Yanni Hu, Yuguang Yang, Wen Fei, Jixun Yao, Heng Lu, Lei Ma, Jianjun Zhao

Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER).

Attribute Contrastive Learning +3

Decom--CAM: Tell Me What You See, In Details! Feature-Level Interpretation via Decomposition Class Activation Map

no code implementations27 May 2023 Yuguang Yang, Runtang Guo, Sheng Wu, Yimi Wang, Juan Zhang, Xuan Gong, Baochang Zhang

Although the Class Activation Map (CAM) is widely used to interpret deep model predictions by highlighting object location, it fails to provide insight into the salient features used by the model to make decisions.

Decision Making

Self-supervised Speaker Recognition Training Using Human-Machine Dialogues

no code implementations7 Feb 2022 Metehan Cekic, Ruirui Li, Zeya Chen, Yuguang Yang, Andreas Stolcke, Upamanyu Madhow

Speaker recognition, recognizing speaker identities based on voice alone, enables important downstream applications, such as personalization and authentication.

Contrastive Learning Speaker Recognition

ASR-Aware End-to-end Neural Diarization

no code implementations2 Feb 2022 Aparna Khare, Eunjung Han, Yuguang Yang, Andreas Stolcke

We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Fast query-by-example speech search using separable model

no code implementations18 Sep 2021 Yuguang Yang, Yu Pan, Xin Dong, Minqiang Xu

Second, we design a novel model inference scheme based on RepVGG which can efficiently improve the QbE search quality.

Word Embeddings

A Deep Reinforcement Learning Architecture for Multi-stage Optimal Control

no code implementations25 Nov 2019 Yuguang Yang

SDQL exploits the linear stage structure by approximating the Q function via a collection of deep Q sub-networks stacking along an axis marking the stage-wise progress of the whole task.

Deep Reinforcement Learning Q-Learning +2

Efficient Navigation of Colloidal Robots in an Unknown Environment via Deep Reinforcement Learning

no code implementations26 Jun 2019 Yuguang Yang, Michael A. Bevan, Bo Li

Equipping active colloidal robots with intelligence such that they can efficiently navigate in unknown complex environments could dramatically impact their use in emerging applications like precision surgery and targeted drug delivery.

Deep Reinforcement Learning Navigate +1

Cannot find the paper you are looking for? You can Submit a new open access paper.