Search Results for author: Xilin Jiang

Found 13 papers, 8 papers with code

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

1 code implementation7 Sep 2024 Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark Hasegawa-Johnson, Mari Ostendorf

However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i. e.\ without speaker segmentation and identification.

Question Answering Speaker Identification

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

no code implementations13 Aug 2024 Yinghao Aaron Li, Xilin Jiang, Jordan Darefsky, Ge Zhu, Nima Mesgarani

While the response speech is being played, the input speech undergoes ASR processing to extract the transcription and speaking style, serving as the context for the ensuing dialogue turn.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis

2 code implementations13 Jul 2024 Xilin Jiang, Yinghao Aaron Li, Adrian Nicolas Florea, Cong Han, Nima Mesgarani

It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks.

speech-recognition Speech Recognition +2

SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model

1 code implementation20 May 2024 Siavash Shams, Sukru Samet Dindar, Xilin Jiang, Nima Mesgarani

Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities.

Ranked #2 on Keyword Spotting on Google Speech Commands V2 35 (using extra training data)

Audio Classification Keyword Spotting +2

Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

no code implementations6 Feb 2024 Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani

In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume.

Language Modelling Large Language Model

Exploring Self-Supervised Contrastive Learning of Spatial Sound Event Representation

no code implementations27 Sep 2023 Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani

In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios.

Contrastive Learning Data Augmentation

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

1 code implementation18 Sep 2023 Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance.

Speech Synthesis

DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes

no code implementations29 May 2023 Xilin Jiang, Yinghao Aaron Li, Nima Mesgarani

Lifelong audio feature extraction involves learning new sound classes incrementally, which is essential for adapting to new data distributions over time.

Acoustic Scene Classification Continual Learning +3

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

2 code implementations20 Jan 2023 Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns.

Text to Speech

Compute and memory efficient universal sound source separation

3 code implementations3 Mar 2021 Efthymios Tzinis, Zhepei Wang, Xilin Jiang, Paris Smaragdis

Recent progress in audio source separation lead by deep learning has enabled many neural network models to provide robust solutions to this fundamental estimation problem.

Audio Source Separation Efficient Neural Network +1

Cannot find the paper you are looking for? You can Submit a new open access paper.