no code implementations • 16 Sep 2024 • Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani
Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process.
1 code implementation • 7 Sep 2024 • Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark Hasegawa-Johnson, Mari Ostendorf
However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i. e.\ without speaker segmentation and identification.
no code implementations • 13 Aug 2024 • Yinghao Aaron Li, Xilin Jiang, Jordan Darefsky, Ge Zhu, Nima Mesgarani
While the response speech is being played, the input speech undergoes ASR processing to extract the transcription and speaking style, serving as the context for the ensuing dialogue turn.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
2 code implementations • 13 Jul 2024 • Xilin Jiang, Yinghao Aaron Li, Adrian Nicolas Florea, Cong Han, Nima Mesgarani
It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks.
1 code implementation • 20 May 2024 • Siavash Shams, Sukru Samet Dindar, Xilin Jiang, Nima Mesgarani
Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities.
Ranked #2 on Keyword Spotting on Google Speech Commands V2 35 (using extra training data)
1 code implementation • 27 Mar 2024 • Xilin Jiang, Cong Han, Nima Mesgarani
In this work, we replace transformers with Mamba, a selective state space model, for speech separation.
no code implementations • 6 Feb 2024 • Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani
In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume.
no code implementations • 27 Sep 2023 • Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani
In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios.
1 code implementation • 18 Sep 2023 • Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance.
no code implementations • 29 May 2023 • Xilin Jiang, Yinghao Aaron Li, Nima Mesgarani
Lifelong audio feature extraction involves learning new sound classes incrementally, which is essential for adapting to new data distributions over time.
2 code implementations • 20 Jan 2023 • Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns.
1 code implementation • 15 May 2022 • Zhepei Wang, Cem Subakan, Xilin Jiang, Junkai Wu, Efthymios Tzinis, Mirco Ravanelli, Paris Smaragdis
In this paper, we work on a sound recognition system that continually incorporates new sound classes.
3 code implementations • 3 Mar 2021 • Efthymios Tzinis, Zhepei Wang, Xilin Jiang, Paris Smaragdis
Recent progress in audio source separation lead by deep learning has enabled many neural network models to provide robust solutions to this fundamental estimation problem.
Ranked #5 on Speech Separation on WHAMR!