1 code implementation • 6 Aug 2024 • Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux
This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution.
1 code implementation • 6 Aug 2024 • Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux
Separation performance is also boosted by adding a novel loss term where separated signals mapped back to their own input mixture are used as pseudo-targets for the signals separated from other channels and mapped to the same channel.
1 code implementation • 27 Feb 2024 • Yoshiki Masuyama, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux
Existing NF-based methods focused on estimating the magnitude of the HRTF from a given sound source direction, and the magnitude is converted to a finite impulse response (FIR) filter.
no code implementations • 12 Dec 2023 • Zexu Pan, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux
Neuro-steered speaker extraction aims to extract the listener's brain-attended speech signal from a multi-talker speech signal, in which the attention is derived from the cortical activity.
no code implementations • 30 Oct 2023 • Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux
Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers.
no code implementations • 16 Oct 2023 • Dimitrios Bralios, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux
The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio.
no code implementations • 16 Oct 2023 • Yu Chen, Xinyuan Qian, Zexu Pan, Kainan Chen, Haizhou Li
The prevailing noise-resistant and reverberation-resistant localization algorithms primarily emphasize separating and providing directional output for each speaker in multi-speaker scenarios, without association with the identity of speakers.
no code implementations • 26 Jul 2023 • Zexu Pan, Marvin Borsdorf, Siqi Cai, Tanja Schultz, Haizhou Li
We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference.
1 code implementation • 22 May 2023 • Yidi Jiang, Ruijie Tao, Zexu Pan, Haizhou Li
To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking.
no code implementations • 2 Nov 2022 • Zexu Pan, Gordon Wichern, François G. Germain, Aswin Subramanian, Jonathan Le Roux
Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers.
1 code implementation • 31 Oct 2022 • Zexu Pan, Wupeng Wang, Marvin Borsdorf, Haizhou Li
In this paper, we study the audio-visual speaker extraction algorithms with intermittent visual cue.
no code implementations • 9 Oct 2022 • Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
1 code implementation • 31 Mar 2022 • Zexu Pan, Meng Ge, Haizhou Li
We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 31 Mar 2022 • Zexu Pan, Xinyuan Qian, Haizhou Li
Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech.
1 code implementation • 30 Sep 2021 • Zexu Pan, Meng Ge, Haizhou Li
The speaker extraction algorithm requires an auxiliary reference, such as a video recording or a pre-recorded speech, to form top-down auditory attention on the target speaker.
4 code implementations • 14 Jul 2021 • Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers.
Active Speaker Detection Audio-Visual Active Speaker Detection
1 code implementation • 14 Jun 2021 • Zexu Pan, Ruijie Tao, Chenglin Xu, Haizhou Li
A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-talker speech mixture when given a cue that represents the target speaker, such as a pre-enrolled speech utterance, or an accompanying video track.
1 code implementation • The ActivityNet Large-Scale Activity Recognition Challenge Workshop, CVPR 2021 • Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers.
Active Speaker Detection Audio-Visual Active Speaker Detection
1 code implementation • 15 Oct 2020 • Zexu Pan, Ruijie Tao, Chenglin Xu, Haizhou Li
Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention.