no code implementations • 24 Jan 2025 • Tianrui Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Yu Jiang, Xiaobao Wang, Xie Chen, Longbiao Wang, Jianwu Dang
To address these challenges, we propose a characteristic-specific partial fine-tuning strategy, short as CSP-FT. First, we use a weighted-sum approach to analyze the contributions of different Transformer layers in a pre-trained codec language TTS model for emotion and speaker control in the generated speech.
no code implementations • 21 Dec 2024 • Jiahui Zhao, Hao Shi, Chenrui Cui, Tianrui Wang, Hexin Liu, Zhaoheng Ni, Lingxuan Ye, Longbiao Wang
In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 21 Dec 2024 • Junyu Wang, Zizhen Lin, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang
Experimental results on the VCTK+DEMAND dataset indicate that Mamba-SEUNet attains a PESQ score of 3. 59, while maintaining low computational complexity.
Ranked #2 on
Speech Enhancement
on VoiceBank + DEMAND
no code implementations • 12 Dec 2024 • Sheng Wu, Xiaobao Wang, Longbiao Wang, Dongxiao He, Jianwu Dang
Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data.
no code implementations • 31 Aug 2024 • Tianrui Wang, Jin Li, Ziyang Ma, Rui Cao, Xie Chen, Longbiao Wang, Meng Ge, Xiaobao Wang, Yuguang Wang, Jianwu Dang, Nyima Tashi
In this way, we can progressively extract pitch variation, speaker, and content representations from the input speech.
no code implementations • 11 Aug 2024 • Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, JianHua Tao
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 29 Jun 2024 • Yuchun Shu, Bo Hu, Yifeng He, Hao Shi, Longbiao Wang, Jianwu Dang
Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • 13 Jun 2024 • Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks.
no code implementations • 12 Apr 2024 • Sheng Wu, Jiaxing Liu, Longbiao Wang, Dongxiao He, Xiaobao Wang, Jianwu Dang
On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features.
no code implementations • 7 Jan 2024 • He Wang, Pengcheng Guo, Yue Li, Ao Zhang, Jiayao Sun, Lei Xie, Wei Chen, Pan Zhou, Hui Bu, Xin Xu, BinBin Zhang, Zhuo Chen, Jian Wu, Longbiao Wang, Eng Siong Chng, Sun Li
To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 18 Dec 2023 • Rui Cao, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang
By bridging the speech enhancement and the Information Bottleneck principle in this letter, we rethink a universal plug-and-play strategy and propose a Refining Underlying Information framework called RUI to rise to the challenges both in theory and practice.
no code implementations • 27 Sep 2023 • Chunyu Qiang, Hao Li, Yixin Tian, Yi Zhao, Ying Zhang, Longbiao Wang, Jianwu Dang
To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models.
no code implementations • 1 Sep 2023 • Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang
However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks.
no code implementations • 28 Jul 2023 • Chunyu Qiang, Hao Li, Hao Ni, He Qu, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang
However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods.
no code implementations • 18 May 2023 • Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang
Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available.
1 code implementation • 22 Feb 2023 • Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang
Visual speech (i. e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production.
no code implementations • 7 Dec 2022 • Yanjie Fu, Haoran Yin, Meng Ge, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang
Recently, many deep learning based beamformers have been proposed for multi-channel speech separation.
no code implementations • 2 Nov 2022 • Kong Aik Lee, Tomi Kinnunen, Daniele Colibro, Claudio Vair, Andreas Nautsch, Hanwu Sun, Liang He, Tianyu Liang, Qiongqiong Wang, Mickael Rouvier, Pierre-Michel Bousquet, Rohan Kumar Das, Ignacio Viñals Bailo, Meng Liu, Héctor Deldago, Xuechen Liu, Md Sahidullah, Sandro Cumani, Boning Zhang, Koji Okabe, Hitoshi Yamamoto, Ruijie Tao, Haizhou Li, Alfonso Ortega Giménez, Longbiao Wang, Luis Buera
This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge.
no code implementations • 2 Nov 2022 • Tongtong Song, Qiang Xu, Haoyu Lu, Longbiao Wang, Hao Shi, Yuqin Lin, Yanbing Yang, Jianwu Dang
It has two stages: the speech awareness (SA) stage and the language fusion (LF) stage.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 11 Oct 2022 • Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li, Nan Li, Kong Aik Lee, Longbiao Wang, Jianwu Dang
The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech.
no code implementations • 9 Oct 2022 • Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
1 code implementation • 15 Jul 2022 • Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang
These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i. e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO.
no code implementations • 29 Jun 2022 • Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang
Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition.
2 code implementations • 24 Jun 2022 • Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang
Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio.
no code implementations • 27 Apr 2022 • Sen Chen, Zhilei Liu, Jiaxing Liu, Longbiao Wang
We utilize pre-trained AU classifier to ensure that the generated images contain correct AU information.
1 code implementation • 21 Feb 2022 • Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li
Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance.
no code implementations • 19 Oct 2021 • Sen Chen, Zhilei Liu, Jiaxing Liu, Zhengxiang Yan, Longbiao Wang
Quantitative and qualitative experiments demonstrate that our method outperforms existing methods in both image quality and lip-sync accuracy.
no code implementations • 9 Oct 2021 • Cheng Gong, Longbiao Wang, ZhenHua Ling, Ju Zhang, Jianwu Dang
The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio.
1 code implementation • 17 Apr 2021 • Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Chang Zeng, Jianwu Dang
Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication.
no code implementations • 19 Nov 2020 • Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li
Speaker extraction requires a sample speech from the target speaker as the reference.
no code implementations • 10 May 2020 • Meng Ge, Cheng-Lin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li
To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+.
Ranked #1 on
Speech Extraction
on WSJ0-2mix-extr
Speech Extraction
Audio and Speech Processing
Sound
no code implementations • 2 May 2020 • Qiang Yu, Shenglan Li, Huajin Tang, Longbiao Wang, Jianwu Dang, Kay Chen Tan
They are also believed to play an essential role in low-power consumption of the biological systems, whose efficiency attracts increasing attentions to the field of neuromorphic computing.
no code implementations • IJCNLP 2019 • Jinxin Chang, Ruifang He, Longbiao Wang, Xiangyu Zhao, Ting Yang, Ruifang Wang
However, the sampled information from latent space usually becomes useless due to the KL divergence vanishing issue, and the highly abstractive global variables easily dilute the personal features of replier, leading to a non replier-specific response.
no code implementations • 23 Oct 2019 • Zhilei Liu, Jiahui Dong, Cuicui Zhang, Longbiao Wang, Jianwu Dang
Most existing AU detection works considering AU relationships are relying on probabilistic graphical models with manually extracted features.
no code implementations • 4 Feb 2019 • Qiang Yu, Yanli Yao, Longbiao Wang, Huajin Tang, Jianwu Dang, Kay Chen Tan
Our framework is a unifying system with a consistent integration of three major functional parts which are sparse encoding, efficient learning and robust readout.
no code implementations • COLING 2018 • Fengyu Guo, Ruifang He, Di Jin, Jianwu Dang, Longbiao Wang, Xiangang Li
In this paper, we propose a novel neural Tensor network framework with Interactive Attention and Sparse Learning (TIASL) for implicit discourse relation recognition.
no code implementations • COLING 2018 • Ruifang He, Xuefei Zhang, Di Jin, Longbiao Wang, Jianwu Dang, Xiangang Li
They ignore that one discusses diverse topics when dynamically interacting with different people.
no code implementations • 21 Mar 2018 • Haotian Guan, Zhilei Liu, Longbiao Wang, Jianwu Dang, Ruiguo Yu
Recently, increasing attention has been directed to the study of the speech emotion recognition, in which global acoustic features of an utterance are mostly used to eliminate the content differences.