no code implementations • dialdoc (ACL) 2022 • Kun Li, Tianhua Zhang, Liping Tang, Junan Li, Hongyuan Lu, Xixin Wu, Helen Meng
For the response generator, we use grounding span prediction as an auxiliary task to be jointly trained with the main task of response generation.
no code implementations • 11 Jun 2025 • Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li
Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker's voice from others.
no code implementations • 10 Jun 2025 • Yuejiao Wang, Xianmin Gong, Xixin Wu, Patrick Wong, Hoi-lam Helene Fung, Man Wai Mak, Helen Meng
The study demonstrated the potential of the naturalistic language-related fMRI task for early detection of aging-related cognitive decline and NCD.
no code implementations • 28 May 2025 • Kun Li, Yunxiang Li, Tianhua Zhang, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems.
no code implementations • 24 May 2025 • Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu
These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs.
no code implementations • 1 Apr 2025 • Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li
Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues.
no code implementations • 3 Mar 2025 • Kun Li, Tianhua Zhang, Yunxiang Li, Hongyin Luo, Abdalla Moustafa, Xixin Wu, James Glass, Helen Meng
Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts.
no code implementations • 19 Jan 2025 • Jingran Xie, Shun Lei, Yue Yu, Yang Xiang, Hui Wang, Xixin Wu, Zhiyong Wu
Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement.
no code implementations • 7 Jan 2025 • Jinchao Li, Yuejiao Wang, Junan Li, Jiawen Kang, Bo Zheng, Simon Wong, Brian Mak, Helene Fung, Jean Woo, Man-Wai Mak, Timothy Kwok, Vincent Mok, Xianmin Gong, Xixin Wu, Xunying Liu, Patrick Wong, Helen Meng
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management.
no code implementations • 2 Jan 2025 • Dongyang Dai, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, Helen Meng
The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character's pronunciation according to BERT output.
no code implementations • 2 Jan 2025 • Dongyang Dai, Zhiyong Wu, Runnan Li, Xixin Wu, Jia Jia, Helen Meng
Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker.
no code implementations • 30 Dec 2024 • Xiaohan Feng, Xixin Wu, Helen Meng
We propose an ontology-grounded approach to Knowledge Graph (KG) construction using Large Language Models (LLMs) on a knowledge base.
no code implementations • 9 Dec 2024 • Jiawen Kang, Junan Li, Jinchao Li, Xixin Wu, Helen Meng
Automatic Speech Recognition (ASR) plays an important role in speech-based automatic detection of Alzheimer's disease (AD).
no code implementations • 28 Nov 2024 • Junan Li, Yunxiang Li, Yuren Wang, Xixin Wu, Helen Meng
Alzheimer's disease (AD) has become one of the most significant health challenges in an aging society.
no code implementations • 12 Nov 2024 • Dongrui Han, Mingyu Cui, Jiawen Kang, Xixin Wu, Xunying Liu, Helen Meng
Using GPT-4 in the ICKR system can increase of 3. 5% absolute (3. 8% relative) on the Librig2p dataset.
no code implementations • 24 Oct 2024 • Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James Glass, Helen Meng
We argue that this concept can serve as a principle for making faithful and sound reasoning for KGQA.
2 code implementations • 27 Sep 2024 • Siheng Li, Cheng Yang, Taiqiang Wu, Chufan Shi, Yuji Zhang, Xinyu Zhu, Zesen Cheng, Deng Cai, Mo Yu, Lemao Liu, Jie zhou, Yujiu Yang, Ngai Wong, Xixin Wu, Wai Lam
Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don't know and be able to faithfully express their knowledge.
no code implementations • 22 Sep 2024 • Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng
Alzheimer's Disease (AD) detection employs machine learning classification models to distinguish between individuals with AD and those without.
1 code implementation • 19 Sep 2024 • Jiawen Kang, Lingwei Meng, Mingyu Cui, Yuejiao Wang, Xixin Wu, Xunying Liu, Helen Meng
SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames.
no code implementations • 19 Sep 2024 • Yuanyuan Wang, Hangting Chen, Dongchao Yang, Zhiyong Wu, Xixin Wu
Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style.
1 code implementation • 13 Sep 2024 • Lingwei Meng, Shujie Hu, Jiawen Kang, Zhaoqing Li, Yuejiao Wang, Wenxuan Wu, Xixin Wu, Xunying Liu, Helen Meng
Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 25 Aug 2024 • Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng
With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
no code implementations • 18 Jul 2024 • Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng
Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech. Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.
no code implementations • 15 Jul 2024 • Yuejiao Wang, Xianmin Gong, Lingwei Meng, Xixin Wu, Helen Meng
This study highlights the potential of fMRI encoding models and brain scores for detecting early functional changes in NCD patients.
1 code implementation • 13 Jul 2024 • Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng
In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks.
no code implementations • 11 Jul 2024 • Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei
We present MELLE, a novel continuous-valued token based language modeling approach for text-to-speech synthesis (TTS).
no code implementations • 1 Jul 2024 • Jingyan Zhou, Kun Li, Junan Li, Jiawen Kang, Minda Hu, Xixin Wu, Helen Meng
In PAD, we automatically collect conversational data that cover the vulnerabilities of an LLM around specific safety risks in a self-play manner, where the attacker aims to elicit unsafe responses and the defender generates safe responses to these attacks.
no code implementations • 20 Jun 2024 • Jing Xu, Minglin Wu, Xixin Wu, Helen Meng
Experiments show that our adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with MOS value increased about 1. 6 and the relative value of WER reduced up to 61. 72%.
no code implementations • 16 Jun 2024 • Tianhua Zhang, Kun Li, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
A novel approach is then proposed to assess retriever's preference for these candidates by the probability of answers conditioned on the conversational query by marginalizing the Top-$K$ passages.
no code implementations • 12 Jun 2024 • Xueyuan Chen, Dongchao Yang, Dingdong Wang, Xixin Wu, Zhiyong Wu, Helen Meng
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
1 code implementation • 27 Feb 2024 • Hao Ma, Zhiyuan Peng, Xu Li, Mingjie Shao, Xixin Wu, Ju Liu
Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings.
Ranked #1 on
Target Sound Extraction
on AudioSet
no code implementations • 26 Jan 2024 • Yuejiao Wang, Xixin Wu, Disong Wang, Lingwei Meng, Helen Meng
Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech.
1 code implementation • 8 Jan 2024 • Jiawen Kang, Lingwei Meng, Mingyu Cui, Haohan Guo, Xixin Wu, Xunying Liu, Helen Meng
To the best of our knowledge, this work represents an early effort to integrate SIMO and SISO for multi-talker speech recognition.
no code implementations • 19 Dec 2023 • Xueyuan Chen, Xi Wang, Shaofei Zhang, Lei He, Zhiyong Wu, Xixin Wu, Helen Meng
Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios.
no code implementations • 19 Dec 2023 • Boshi Tang, Zhiyong Wu, Xixin Wu, Qiaochu Huang, Jun Chen, Shun Lei, Helen Meng
A novel calibration framework, named SimCalib, is accordingly proposed to consider similarity between nodes at global and local levels.
no code implementations • 27 Nov 2023 • Xiaohan Feng, Xixin Wu, Helen Meng
This correlation facilitates a comprehensive understanding of the linguistic features influencing the DST model's decision-making process.
1 code implementation • 19 Sep 2023 • Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, James Glass
How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning?
no code implementations • 31 Aug 2023 • Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech.
1 code implementation • 31 Aug 2023 • Haohan Guo, Fenglong Xie, Jiawen Kang, Yujia Xiao, Xixin Wu, Helen Meng
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio.
no code implementations • 29 Aug 2023 • Jingyan Zhou, Minda Hu, Junan Li, Xiaoying Zhang, Xixin Wu, Irwin King, Helen Meng
Making moral judgments is an essential step toward developing ethical AI systems.
no code implementations • 25 May 2023 • Lingwei Meng, Jiawen Kang, Mingyu Cui, Haibin Wu, Xixin Wu, Helen Meng
Extending on this, we incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 24 May 2023 • Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen Meng, James Glass
Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information.
1 code implementation • 7 Apr 2023 • Tianhua Zhang, Hongyin Luo, Yung-Sung Chuang, Wei Fang, Luc Gaitskell, Thomas Hartvigsen, Xixin Wu, Danny Fox, Helen Meng, James Glass
Despite recent concerns about undesirable behaviors generated by large language models (LLMs), including non-factual, biased, and hateful language, we find LLMs are inherent multi-task language checkers based on their latent representations of natural and social knowledge.
1 code implementation • 14 Mar 2023 • Jinchao Li, Xixin Wu, Kaitao Song, Dongsheng Li, Xunying Liu, Helen Meng
Experimental results based on the ACII Challenge 2022 dataset demonstrate the superior performance of the proposed system and the effectiveness of considering multiple relationships using hierarchical regression chain models.
Ranked #1 on
Vocal Bursts Intensity Prediction
on HUME-VB
no code implementations • 14 Mar 2023 • Jinchao Li, Kaitao Song, Junan Li, Bo Zheng, Dongsheng Li, Xixin Wu, Xunying Liu, Helen Meng
This paper presents several efficient methods to extract better AD-related cues from high-level acoustic and linguistic features.
1 code implementation • 20 Feb 2023 • Lingwei Meng, Jiawen Kang, Mingyu Cui, Yuejiao Wang, Xixin Wu, Helen Meng
Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-talker overlapping speech recognition remains challenging.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 2 Feb 2023 • Holam Chung, Junan Li, Pengfei Liu1, Wai-Kim Leung, Xixin Wu, Helen Meng
Homophone characters are common in tonal syllable-based languages, such as Mandarin and Cantonese.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 27 Oct 2022 • Haohan Guo, Fenglong Xie, Xixin Wu, Hui Lu, Helen Meng
Moreover, we optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages.
no code implementations • 25 Oct 2022 • Hui Lu, Disong Wang, Xixin Wu, Zhiyong Wu, Xunying Liu, Helen Meng
We propose an unsupervised learning method to disentangle speech into content representation and speaker identity representation.
1 code implementation • 22 Sep 2022 • Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng
A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively.
no code implementations • 28 Jun 2022 • Yi Wang, Tianzi Wang, Zi Ye, Lingwei Meng, Shoukang Hu, Xixin Wu, Xunying Liu, Helen Meng
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and delay progression.
no code implementations • 18 Jun 2022 • Haibin Wu, Jiawen Kang, Lingwei Meng, Yang Zhang, Xixin Wu, Zhiyong Wu, Hung-Yi Lee, Helen Meng
However, previous works show that state-of-the-art ASV models are seriously vulnerable to voice spoofing attacks, and the recently proposed high-performance spoofing countermeasure (CM) models only focus solely on the standalone anti-spoofing tasks, and ignore the subsequent speaker verification process.
no code implementations • 31 Mar 2022 • Xixin Wu, Shoukang Hu, Zhiyong Wu, Xunying Liu, Helen Meng
Deep neural networks have brought significant advancements to speech emotion recognition (SER).
no code implementations • 29 Mar 2022 • Haibin Wu, Lingwei Meng, Jiawen Kang, Jinchao Li, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng
In the second-level fusion, the CM score and ASV scores directly from ASV systems will be concatenated into a prediction block for the final decision.
no code implementations • 8 Mar 2022 • Wen Wu, Chao Zhang, Xixin Wu, Philip C. Woodland
In this paper, a novel Bayesian training loss based on per-utterance Dirichlet prior distributions is proposed for verbal emotion recognition, which models the uncertainty in one-hot labels created when human annotators assign the same utterance to different emotion classes.
no code implementations • 18 Feb 2022 • Disong Wang, Songxiang Liu, Xixin Wu, Hui Lu, Lifa Sun, Xunying Liu, Helen Meng
The primary task of ASA fine-tunes the SE with the speech of the target dysarthric speaker to effectively capture identity-related information, and the secondary task applies adversarial training to avoid the incorporation of abnormal speaking patterns into the reconstructed speech, by regularizing the distribution of reconstructed speech to be close to that of reference speech with high quality.
no code implementations • 4 Feb 2022 • Naijun Zheng, Na Li, Xixin Wu, Lingwei Meng, Jiawen Kang, Haibin Wu, Chao Weng, Dan Su, Helen Meng
This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks.
no code implementations • 8 Nov 2021 • Haibin Wu, Bo Zheng, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng
As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority.
2 code implementations • 19 Jul 2021 • Xu Li, Xixin Wu, Hui Lu, Xunying Liu, Helen Meng
This argument motivates the current work that presents a novel, channel-wise gated Res2Net (CG-Res2Net), which modifies Res2Net to enable a channel-wise gating mechanism in the connection between feature groups.
1 code implementation • 2 Apr 2021 • Qingyun Dou, Yiting Lu, Potsawee Manakul, Xixin Wu, Mark J. F. Gales
This approach guides the model with the generated output history and reference attention, and can reduce the training-inference mismatch without a schedule or a classifier.
no code implementations • 13 Jan 2021 • Xixin Wu, Mark Gales
It is shown that well calibrated ensemble members will not necessarily yield a well calibrated ensemble prediction, and if the ensemble prediction is well calibrated its performance cannot exceed that of the average performance of the calibrated ensemble members.
no code implementations • 3 Nov 2020 • Disong Wang, Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu, Helen Meng
Third, a conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech, conditioned on the target DSE that is learned via speaker encoder or speaker adaptation.
1 code implementation • 6 Sep 2020 • Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, Helen Meng
During the training stage, an encoder-decoder-based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer.
no code implementations • 11 Jun 2020 • Xu Li, Na Li, Jinghua Zhong, Xixin Wu, Xunying Liu, Dan Su, Dong Yu, Helen Meng
Orthogonal to prior approaches, this work proposes to defend ASV systems against adversarial attacks with a separate detection network, rather than augmenting adversarial data into ASV training.
no code implementations • 8 Apr 2020 • Xu Li, Jinghua Zhong, Jianwei Yu, Shoukang Hu, Xixin Wu, Xunying Liu, Helen Meng
Our experiment results indicate that the DNN x-vector system could benefit from BNNs especially when the mismatch problem is severe for evaluations using out-of-domain data.
no code implementations • 1 Feb 2020 • Xu Li, Xixin Wu, Xunying Liu, Helen Meng
And then we explore the non-categories by looking for the SPPGs with more than one peak.
2 code implementations • 8 Nov 2019 • Xu Li, Jinghua Zhong, Xixin Wu, Jianwei Yu, Xunying Liu, Helen Meng
Experiment results show that GMM i-vector systems are seriously vulnerable to adversarial attacks, and the crafted adversarial samples prove to be transferable and pose threats to neuralnetwork speaker embedding based systems (e. g. x-vector systems).
no code implementations • IJCNLP 2019 • Ming Liao, Jing Li, Haisong Zhang, Lingzhi Wang, Xixin Wu, Kam-Fai Wong
Aspect words, indicating opinion targets, are essential in expressing and understanding human opinions.
2 code implementations • 30 Aug 2019 • Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, Dong Yu
End-to-end speech synthesis methods already achieve close-to-human quality performance.