no code implementations • 28 Oct 2024 • Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, Zhou Zhao
The scaling up has brought tremendous success in the fields of vision and language in recent years.
1 code implementation • 23 Oct 2024 • Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan
However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech.
no code implementations • 16 Oct 2024 • RuiQi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, Zhou Zhao
Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives.
1 code implementation • 20 Sep 2024 • Han Yin, Jisheng Bai, Yang Xiao, Hui Wang, Siqi Zheng, Yafeng Chen, Rohan Kumar Das, Chong Deng, Jianfeng Chen
To address this issue, we propose the text-queried SED (TQ-SED) framework.
1 code implementation • 29 Aug 2024 • Shengpeng Ji, Ziyue Jiang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, RuiQi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, Zhou Zhao
Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
no code implementations • 22 Aug 2024 • Luyao Cheng, Hui Wang, Siqi Zheng, Yafeng Chen, Rongjie Huang, Qinglin Zhang, Qian Chen, Xihao Li
Then we introduce a joint pairwise constraint propagation algorithm to cluster speakers based on these visual and semantic constraints.
no code implementations • 7 Jul 2024 • Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
2 code implementations • 4 Jul 2024 • Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang, Zhangyu Xiao, Zhijie Yan, Yexin Yang, Bin Zhang, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Siqi Zheng
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).
no code implementations • 2 Jul 2024 • RuiQi Li, Zhiqing Hong, Yongqi Wang, Lichao Zhang, Rongjie Huang, Siqi Zheng, Zhou Zhao
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices.
2 code implementations • 17 Jun 2024 • Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Wen Wang
SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views.
no code implementations • 17 Jun 2024 • Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Hai Yu, Jiaqing Liu, Yukun Ma, Chong Zhang
The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies.
1 code implementation • 4 Jun 2024 • Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Junjie Li
Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings.
1 code implementation • 3 Jun 2024 • Shengpeng Ji, Jialong Zuo, Wen Wang, Minghui Fang, Siqi Zheng, Qian Chen, Ziyue Jiang, Hai Huang, Zehan Wang, Xize Cheng, Zhou Zhao
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt.
2 code implementations • 1 Jun 2024 • Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao
To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver.
2 code implementations • 29 Mar 2024 • Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Tinglong Zhu, Rongjie Huang, Chong Deng, Qian Chen, Shiliang Zhang, Wen Wang, Xihao Li
With 3D-Speaker-Toolkit, we establish a new benchmark for multimodal speaker analysis.
1 code implementation • 19 Feb 2024 • Shengpeng Ji, Minghui Fang, Ziyue Jiang, Siqi Zheng, Qian Chen, Rongjie Huang, Jialung Zuo, Shulei Wang, Zhou Zhao
Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models.
1 code implementation • 13 Feb 2024 • Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, JiaMing Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 8 Nov 2023 • Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang
We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach.
2 code implementations • 7 Oct 2023 • Zhihao Du, JiaMing Wang, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang
Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features.
no code implementations • 19 Sep 2023 • Luyao Cheng, Siqi Zheng, Qinglin Zhang, Hui Wang, Yafeng Chen, Qian Chen, Shiliang Zhang
Speaker diarization has gained considerable attention within speech processing research community.
1 code implementation • 14 Sep 2023 • Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng
We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis.
2 code implementations • 5 Aug 2023 • Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Wen Wang
To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN.
no code implementations • 14 Jul 2023 • Qian Chen, Wen Wang, Qinglin Zhang, Chong Deng, Ma Yukun, Siqi Zheng
Transformer-based pre-trained language models, such as BERT, achieve great success in various natural language understanding tasks.
2 code implementations • 27 Jun 2023 • Siqi Zheng, Luyao Cheng, Yafeng Chen, Hui Wang, Qian Chen
Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community.
no code implementations • 22 May 2023 • Luyao Cheng, Siqi Zheng, Zhang Qinglin, Hui Wang, Yafeng Chen, Qian Chen
In this paper, we propose methods to extract speaker-related information from semantic content in multi-party meetings, which, as we will show, can further benefit speaker diarization.
2 code implementations • 22 May 2023 • Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Jiajun Qi
This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance.
1 code implementation • 18 May 2023 • Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Chong Deng, Hai Yu, Jiaqing Liu, Yukun Ma, Chong Zhang
Prior studies diagnose the anisotropy problem in sentence representations from pre-trained language models, e. g., BERT, without fine-tuning.
no code implementations • 14 Dec 2022 • Jinglin Liu, Zhenhui Ye, Qian Chen, Siqi Zheng, Wen Wang, Qinglin Zhang, Zhou Zhao
Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual realities.
no code implementations • 26 Nov 2022 • Jianhong Tu, Zeyu Cui, Xiaohuan Zhou, Siqi Zheng, Kai Hu, Ju Fan, Chang Zhou
To achieve this task, we construct a synthetic dataset and develop an effective framework.
1 code implementation • 8 Nov 2022 • Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen
A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the regularized DINO framework in speaker verification.
no code implementations • 28 May 2022 • Fuchuan Tong, Siqi Zheng, Haodong Zhou, Xingjia Xie, Qingyang Hong, Lin Li
While promising performance for speaker verification has been achieved by deep speaker embeddings, the advantage would reduce in the case of speaking-style variability.
no code implementations • 26 Apr 2022 • Siqi Zheng, Hongbin Suo
In this paper we propose to view clustering-based diarization as a community detection problem.
no code implementations • 25 Apr 2022 • Fuchuan Tong, Siqi Zheng, Min Zhang, Yafeng Chen, Hongbin Suo, Qingyang Hong, Lin Li
In this work, we present a GCN-based approach for semi-supervised learning.
1 code implementation • 18 Mar 2022 • Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan
Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings.
Ranked #1 on Speaker Diarization on AliMeeting
2 code implementations • 28 Nov 2021 • Zhihao Du, Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei
In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set.
1 code implementation • ICLR 2022 • Chao-Hong Tan, Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Zhen-Hua Ling
We propose a novel Pooling Network (PoNet) for token mixing in long sequences with linear complexity.
no code implementations • 9 Sep 2021 • Siqi Zheng, Shiliang Zhang, Weilong Huang, Qian Chen, Hongbin Suo, Ming Lei, Jinwei Feng, Zhijie Yan
We propose BeamTransformer, an efficient architecture to leverage beamformer's edge in spatial filtering and transformer's capability in context sequence modeling.
no code implementations • 27 Jul 2021 • Yuchen Chai, Juan Palacios, Jianghao Wang, Yichun Fan, Siqi Zheng
COVID-19, as a global health crisis, has triggered the fear emotion with unprecedented intensity.
no code implementations • 20 Jul 2021 • Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng, Zhijie Yan
In this paper we describe a speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting.
no code implementations • 29 May 2021 • Da Zhang, Qingyi Wang, Shaojie Song, Simiao Chen, MingWei Li, Lu Shen, Siqi Zheng, Bofeng Cai, Shenhao Wang
Applications of the framework with Chinese data reveal highly heterogeneous health benefits of reducing fossil fuel use in different sectors and regions in China with a mean of \$34/tCO2 and a standard deviation of \$84/tCO2.