Search Results for author: Xixin Wu

Found 69 papers, 18 papers with code

Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction

no code implementations11 Jun 2025 Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li

Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker's voice from others.

Speech Extraction Target Speaker Extraction

Naturalistic Language-related Movie-Watching fMRI Task for Detecting Neurocognitive Decline and Disorder

no code implementations10 Jun 2025 Yuejiao Wang, Xianmin Gong, Xixin Wu, Patrick Wong, Hoi-lam Helene Fung, Man Wai Mak, Helen Meng

The study demonstrated the potential of the naturalistic language-related fMRI task for early detection of aging-related cognitive decline and NCD.

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

no code implementations1 Apr 2025 Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li

Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues.

Target Speaker Extraction

Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution

no code implementations3 Mar 2025 Kun Li, Tianhua Zhang, Yunxiang Li, Hongyin Luo, Abdalla Moustafa, Xixin Wu, James Glass, Helen Meng

Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts.

counterfactual Domain Adaptation +3

Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

no code implementations19 Jan 2025 Jingran Xie, Shun Lei, Yue Yu, Yang Xiang, Hui Wang, Xixin Wu, Zhiyong Wu

Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement.

Dialogue Generation Question Answering

Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT

no code implementations2 Jan 2025 Dongyang Dai, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, Helen Meng

The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character's pronunciation according to BERT output.

Polyphone disambiguation Sentence +2

learning discriminative features from spectrograms using center loss for speech emotion recognition

no code implementations2 Jan 2025 Dongyang Dai, Zhiyong Wu, Runnan Li, Xixin Wu, Jia Jia, Helen Meng

Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker.

Speech Emotion Recognition

Ontology-grounded Automatic Knowledge Graph Construction by LLM under Wikidata schema

no code implementations30 Dec 2024 Xiaohan Feng, Xixin Wu, Helen Meng

We propose an ontology-grounded approach to Knowledge Graph (KG) construction using Large Language Models (LLMs) on a knowledge base.

graph construction

A Survey on the Honesty of Large Language Models

2 code implementations27 Sep 2024 Siheng Li, Cheng Yang, Taiqiang Wu, Chufan Shi, Yuji Zhang, Xinyu Zhu, Zesen Cheng, Deng Cai, Mo Yu, Lemao Liu, Jie zhou, Yujiu Yang, Ngai Wong, Xixin Wu, Wai Lam

Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don't know and be able to faithfully express their knowledge.

Survey

On the Within-class Variation Issue in Alzheimer's Disease Detection

no code implementations22 Sep 2024 Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng

Alzheimer's Disease (AD) detection employs machine learning classification models to distinguish between individuals with AD and those without.

Alzheimer's Disease Detection Binary Classification +2

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

1 code implementation19 Sep 2024 Jiawen Kang, Lingwei Meng, Mingyu Cui, Yuejiao Wang, Xixin Wu, Xunying Liu, Helen Meng

SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames.

Disentanglement speech-recognition +1

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

no code implementations19 Sep 2024 Yuanyuan Wang, Hangting Chen, Dongchao Yang, Zhiyong Wu, Xixin Wu

Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style.

Audio Generation

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

no code implementations25 Aug 2024 Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.

text-to-speech Text to Speech

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

no code implementations18 Jul 2024 Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng

Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech. Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.

Language Modeling Language Modelling +4

Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder

no code implementations15 Jul 2024 Yuejiao Wang, Xianmin Gong, Lingwei Meng, Xixin Wu, Helen Meng

This study highlights the potential of fMRI encoding models and brain scores for detecting early functional changes in NCD patients.

Language Modeling Language Modelling +1

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

1 code implementation13 Jul 2024 Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng

In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks.

Decoder speech-recognition +1

Purple-teaming LLMs with Adversarial Defender Training

no code implementations1 Jul 2024 Jingyan Zhou, Kun Li, Junan Li, Jiawen Kang, Minda Hu, Xixin Wu, Helen Meng

In PAD, we automatically collect conversational data that cover the vulnerabilities of an LLM around specific safety risks in a self-play manner, where the attacker aims to elicit unsafe responses and the defender generates safe responses to these attacks.

Generative Adversarial Network Red Teaming

Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models

no code implementations20 Jun 2024 Jing Xu, Minglin Wu, Xixin Wu, Helen Meng

Experiments show that our adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with MOS value increased about 1. 6 and the relative value of WER reduced up to 61. 72%.

Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers

no code implementations16 Jun 2024 Tianhua Zhang, Kun Li, Hongyin Luo, Xixin Wu, James Glass, Helen Meng

A novel approach is then proposed to assess retriever's preference for these candidates by the probability of answers conditioned on the conversational query by marginalizing the Top-$K$ passages.

Conversational Question Answering Passage Retrieval +1

UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization

no code implementations26 Jan 2024 Yuejiao Wang, Xixin Wu, Disong Wang, Lingwei Meng, Helen Meng

Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech.

Decoder Domain Adaptation +3

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

1 code implementation8 Jan 2024 Jiawen Kang, Lingwei Meng, Mingyu Cui, Haohan Guo, Xixin Wu, Xunying Liu, Helen Meng

To the best of our knowledge, this work represents an early effort to integrate SIMO and SISO for multi-talker speech recognition.

Decoder speech-recognition +1

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

no code implementations19 Dec 2023 Xueyuan Chen, Xi Wang, Shaofei Zhang, Lei He, Zhiyong Wu, Xixin Wu, Helen Meng

Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios.

Decoder Speech Synthesis

SimCalib: Graph Neural Network Calibration based on Similarity between Nodes

no code implementations19 Dec 2023 Boshi Tang, Zhiyong Wu, Xixin Wu, Qiaochu Huang, Jun Chen, Shun Lei, Helen Meng

A novel calibration framework, named SimCalib, is accordingly proposed to consider similarity between nodes at global and local levels.

Graph Neural Network

Injecting linguistic knowledge into BERT for Dialogue State Tracking

no code implementations27 Nov 2023 Xiaohan Feng, Xixin Wu, Helen Meng

This correlation facilitates a comprehensive understanding of the linguistic features influencing the DST model's decision-making process.

Decision Making Dialogue State Tracking

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

no code implementations31 Aug 2023 Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech.

Decoder Multi-Task Learning +2

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

1 code implementation31 Aug 2023 Haohan Guo, Fenglong Xie, Jiawen Kang, Yujia Xiao, Xixin Wu, Helen Meng

This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio.

Representation Learning Speech Representation Learning +5

Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

no code implementations25 May 2023 Lingwei Meng, Jiawen Kang, Mingyu Cui, Haibin Wu, Xixin Wu, Helen Meng

Extending on this, we incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

SAIL: Search-Augmented Instruction Learning

no code implementations24 May 2023 Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen Meng, James Glass

Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information.

Denoising Fact Checking +3

Interpretable Unified Language Checking

1 code implementation7 Apr 2023 Tianhua Zhang, Hongyin Luo, Yung-Sung Chuang, Wei Fang, Luc Gaitskell, Thomas Hartvigsen, Xixin Wu, Danny Fox, Helen Meng, James Glass

Despite recent concerns about undesirable behaviors generated by large language models (LLMs), including non-factual, biased, and hateful language, we find LLMs are inherent multi-task language checkers based on their latent representations of natural and social knowledge.

Fact Checking Fairness +2

A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition

1 code implementation14 Mar 2023 Jinchao Li, Xixin Wu, Kaitao Song, Dongsheng Li, Xunying Liu, Helen Meng

Experimental results based on the ACII Challenge 2022 dataset demonstrate the superior performance of the proposed system and the effectiveness of considering multiple relationships using hierarchical regression chain models.

A-VB Culture A-VB High +6

A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One

1 code implementation20 Feb 2023 Lingwei Meng, Jiawen Kang, Mingyu Cui, Yuejiao Wang, Xixin Wu, Helen Meng

Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-talker overlapping speech recognition remains challenging.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

1 code implementation27 Oct 2022 Haohan Guo, Fenglong Xie, Xixin Wu, Hui Lu, Helen Meng

Moreover, we optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages.

Transfer Learning

A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

1 code implementation22 Sep 2022 Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng

A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively.

Triplet

Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion

no code implementations18 Jun 2022 Haibin Wu, Jiawen Kang, Lingwei Meng, Yang Zhang, Xixin Wu, Zhiyong Wu, Hung-Yi Lee, Helen Meng

However, previous works show that state-of-the-art ASV models are seriously vulnerable to voice spoofing attacks, and the recently proposed high-performance spoofing countermeasure (CM) models only focus solely on the standalone anti-spoofing tasks, and ignore the subsequent speaker verification process.

Open-Ended Question Answering Speaker Verification

Spoofing-Aware Speaker Verification by Multi-Level Fusion

no code implementations29 Mar 2022 Haibin Wu, Lingwei Meng, Jiawen Kang, Jinchao Li, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng

In the second-level fusion, the CM score and ASV scores directly from ASV systems will be concatenated into a prediction block for the final decision.

Speaker Verification

Estimating the Uncertainty in Emotion Class Labels with Utterance-Specific Dirichlet Priors

no code implementations8 Mar 2022 Wen Wu, Chao Zhang, Xixin Wu, Philip C. Woodland

In this paper, a novel Bayesian training loss based on per-utterance Dirichlet prior distributions is proposed for verbal emotion recognition, which models the uncertainty in one-hot labels created when human annotators assign the same utterance to different emotion classes.

Attribute Emotion Classification +1

Speaker Identity Preservation in Dysarthric Speech Reconstruction by Adversarial Speaker Adaptation

no code implementations18 Feb 2022 Disong Wang, Songxiang Liu, Xixin Wu, Hui Lu, Lifa Sun, Xunying Liu, Helen Meng

The primary task of ASA fine-tunes the SE with the speech of the target dysarthric speaker to effectively capture identity-related information, and the secondary task applies adversarial training to avoid the incorporation of abnormal speaking patterns into the reconstructed speech, by regularizing the distribution of reconstructed speech to be close to that of reference speech with high quality.

Multi-Task Learning Speaker Verification

The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

no code implementations4 Feb 2022 Naijun Zheng, Na Li, Xixin Wu, Lingwei Meng, Jiawen Kang, Haibin Wu, Chao Weng, Dan Su, Helen Meng

This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks.

Action Detection Activity Detection +6

Characterizing the adversarial vulnerability of speech self-supervised learning

no code implementations8 Nov 2021 Haibin Wu, Bo Zheng, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng

As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority.

Adversarial Robustness Benchmarking +3

Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks

2 code implementations19 Jul 2021 Xu Li, Xixin Wu, Hui Lu, Xunying Liu, Helen Meng

This argument motivates the current work that presents a novel, channel-wise gated Res2Net (CG-Res2Net), which modifies Res2Net to enable a channel-wise gating mechanism in the connection between feature groups.

Speaker Verification

Attention Forcing for Machine Translation

1 code implementation2 Apr 2021 Qingyun Dou, Yiting Lu, Potsawee Manakul, Xixin Wu, Mark J. F. Gales

This approach guides the model with the generated output history and reference attention, and can reduce the training-inference mismatch without a schedule or a classifier.

Machine Translation NMT +3

Should Ensemble Members Be Calibrated?

no code implementations13 Jan 2021 Xixin Wu, Mark Gales

It is shown that well calibrated ensemble members will not necessarily yield a well calibrated ensemble prediction, and if the ensemble prediction is well calibrated its performance cannot exceed that of the average performance of the calibrated ensemble members.

image-classification Image Classification +2

Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion

no code implementations3 Nov 2020 Disong Wang, Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu, Helen Meng

Third, a conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech, conditioned on the target DSE that is learned via speaker encoder or speaker adaptation.

speech-recognition Speech Recognition +1

Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

1 code implementation6 Sep 2020 Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, Helen Meng

During the training stage, an encoder-decoder-based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer.

feature selection speech-recognition +2

Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification

no code implementations11 Jun 2020 Xu Li, Na Li, Jinghua Zhong, Xixin Wu, Xunying Liu, Dan Su, Dong Yu, Helen Meng

Orthogonal to prior approaches, this work proposes to defend ASV systems against adversarial attacks with a separate detection network, rather than augmenting adversarial data into ASV training.

Binary Classification Data Augmentation +1

Bayesian x-vector: Bayesian Neural Network based x-vector System for Speaker Verification

no code implementations8 Apr 2020 Xu Li, Jinghua Zhong, Jianwei Yu, Shoukang Hu, Xixin Wu, Xunying Liu, Helen Meng

Our experiment results indicate that the DNN x-vector system could benefit from BNNs especially when the mismatch problem is severe for evaluations using out-of-domain data.

Speaker Verification

Deep segmental phonetic posterior-grams based discovery of non-categories in L2 English speech

no code implementations1 Feb 2020 Xu Li, Xixin Wu, Xunying Liu, Helen Meng

And then we explore the non-categories by looking for the SPPGs with more than one peak.

Adversarial Attacks on GMM i-vector based Speaker Verification Systems

2 code implementations8 Nov 2019 Xu Li, Jinghua Zhong, Xixin Wu, Jianwei Yu, Xunying Liu, Helen Meng

Experiment results show that GMM i-vector systems are seriously vulnerable to adversarial attacks, and the crafted adversarial samples prove to be transferable and pose threats to neuralnetwork speaker embedding based systems (e. g. x-vector systems).

Speaker Verification

Maximizing Mutual Information for Tacotron

2 code implementations30 Aug 2019 Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, Dong Yu

End-to-end speech synthesis methods already achieve close-to-human quality performance.

Attribute Speech Synthesis

Cannot find the paper you are looking for? You can Submit a new open access paper.