Search Results for author: Zhihao Du

Found 22 papers, 12 papers with code

SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer

no code implementations16 Feb 2025 Zhengyan Sheng, Zhihao Du, Shiliang Zhang, Zhijie Yan, Yexin Yang, ZhenHua Ling

This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech, facilitating seamless interaction with large language models.

Text to Speech

Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation

no code implementations11 Jan 2025 Zhengyan Sheng, Zhihao Du, Heng Lu, Shiliang Zhang, Zhen-Hua Ling

Recent advancements in personalized speech generation have brought synthetic speech increasingly close to the realism of target speakers' recordings, yet multimodal speaker generation remains on the rise.

Diversity

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

1 code implementation13 Dec 2024 Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, Jingren Zhou

By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode.

In-Context Learning Quantization +1

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

1 code implementation23 Oct 2024 Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, Shiliang Zhang

However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech.

Large Language Model Spoken Dialogue Systems

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

no code implementations22 Oct 2024 Guanrou Yang, Fan Yu, Ziyang Ma, Zhihao Du, Zhifu Gao, Shiliang Zhang, Xie Chen

While automatic speech recognition (ASR) systems have achieved remarkable performance with large-scale datasets, their efficacy remains inadequate in low-resource settings, encompassing dialects, accents, minority languages, and long-tail hotwords, domains with significant practical relevance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities

no code implementations9 Oct 2024 Xin Zhang, Xiang Lyu, Zhihao Du, Qian Chen, Dong Zhang, Hangrui Hu, Chaohong Tan, Tianyu Zhao, Yuxuan Wang, Bin Zhang, Heng Lu, Yaqian Zhou, Xipeng Qiu

Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead and increases latency in multi-turn interactions.

Response Generation

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

no code implementations7 Jul 2024 Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.

Language Modelling Large Language Model +6

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

2 code implementations13 Feb 2024 Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, JiaMing Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen

We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

2 code implementations7 Oct 2023 Zhihao Du, JiaMing Wang, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang

Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features.

Audio captioning Automatic Speech Recognition +13

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

1 code implementation14 Sep 2023 Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng

We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis.

Automatic Speech Recognition speech-recognition +4

CASA-ASR: Context-Aware Speaker-Attributed ASR

no code implementations21 May 2023 Mohan Shi, Zhihao Du, Qian Chen, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang, Li-Rong Dai

In addition, a two-pass decoding strategy is further proposed to fully leverage the contextual modeling ability resulting in a better recognition performance.

Automatic Speech Recognition speech-recognition +1

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

1 code implementation18 May 2023 Zhifu Gao, Zerui Li, JiaMing Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, Shiliang Zhang

FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications.

 Ranked #1 on Speech Recognition on WenetSpeech (using extra training data)

Action Detection Activity Detection +2

TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

1 code implementation8 Mar 2023 JiaMing Wang, Zhihao Du, Shiliang Zhang

Recently, end-to-end neural diarization (EEND) is introduced and achieves promising results in speaker-overlapped scenarios.

speaker-diarization Speaker Diarization +1

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

no code implementations31 Mar 2022 Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie

Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

1 code implementation18 Mar 2022 Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings.

Action Detection Activity Detection +3

Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information

2 code implementations28 Nov 2021 Zhihao Du, Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei

In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set.

Action Detection Activity Detection +3

Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events

1 code implementation10 Apr 2019 Hongwei Song, Jiqing Han, Shiwen Deng, Zhihao Du

In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events.

Acoustic Scene Classification Classification +2

Cannot find the paper you are looking for? You can Submit a new open access paper.