Target Speaker Extraction

9 papers with code • 0 benchmarks • 0 datasets

Extract the dialogue content of the specified target in a multi-person dialogue.

Most implemented papers

GPU-accelerated Guided Source Separation for Meeting Transcription

desh2608/gss 10 Dec 2022

In this paper, we describe our improved implementation of GSS that leverages the power of modern GPU-based pipelines, including batched processing of frequencies and segments, to provide 300x speed-up over CPU-based inference.

Muse: Multi-modal target speaker extraction with visual cues

lin9x/av-sepformer 15 Oct 2020

Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention.

Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech

xuchenglin28/speaker_extraction 30 Mar 2021

Inspired by the study on target speaker extraction, e. g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker.

Selective Listening by Synchronizing Speech with Lips

zexupan/reentry 14 Jun 2021

A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-talker speech mixture when given a cue that represents the target speaker, such as a pre-enrolled speech utterance, or an accompanying video track.

L-SpEx: Localized Target Speaker Extraction

gemengtju/l-spex 21 Feb 2022

Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance.

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

zexupan/avse_hybrid_loss 31 Mar 2022

We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem.

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

zexupan/imaginenet 31 Oct 2022

In this paper, we study the audio-visual speaker extraction algorithms with intermittent visual cue.

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

spkgyk/RTFS-Net 29 Sep 2023

This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

haoxiangsnr/llm-tse 11 Oct 2023

However, the effectiveness of these models is hindered in real-world scenarios due to the unreliable or even absence of pre-registered cues.