no code implementations • 16 Jan 2024 • Yang Yang, George Sung, Shao-Fu Shih, Hakan Erdogan, Chehung Lee, Matthias Grundmann
We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones.
no code implementations • 21 Aug 2023 • Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey
The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 13 Mar 2023 • Yang Yang, Shao-Fu Shih, Hakan Erdogan, Jamie Menjay Lin, Chehung Lee, Yunpeng Li, George Sung, Matthias Grundmann
Multi-microphone speech enhancement problem is often decomposed into two decoupled steps: a beamformer that provides spatial filtering and a single-channel speech enhancement model that cleans up the beamformer output.
no code implementations • 29 Mar 2022 • Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey
Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance.
no code implementations • 30 Jun 2021 • Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, John R. Hershey, Llion Jones, Michiel Bacchiani
To make the model computationally feasible, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers.
no code implementations • 1 Jun 2021 • Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey
The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation.
no code implementations • 5 May 2021 • Soumi Maiti, Hakan Erdogan, Kevin Wilson, Scott Wisdom, Shinji Watanabe, John R. Hershey
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
no code implementations • 17 Dec 2020 • Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen
Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years.
no code implementations • 3 Nov 2020 • Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, Jinyu Li, Scott Wisdom, John R. Hershey
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 2 Nov 2020 • Scott Wisdom, Hakan Erdogan, Daniel Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, John Hershey
We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types.
no code implementations • NeurIPS 2020 • Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J. Weiss, Kevin Wilson, John R. Hershey
In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources.
no code implementations • 18 Nov 2019 • Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, Kevin Wilson, Desh Raj, Shinji Watanabe, Zhuo Chen, John R. Hershey
This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation.
no code implementations • 8 May 2019 • Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey
For learnable bases, shorter windows (2. 5 ms) work best on all tasks.
no code implementations • 13 Apr 2019 • Takuya Yoshioka, Zhuo Chen, Changliang Liu, Xiong Xiao, Hakan Erdogan, Dimitrios Dimitriadis
Speaker independent continuous speech separation (SI-CSS) is a task of converting a continuous audio stream, which may contain overlapping voices of unknown speakers, into a fixed number of continuous signals each of which contains no overlapping speech segment.
1 code implementation • 6 Nov 2018 • Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, John R. Hershey
In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality.
Sound Audio and Speech Processing
no code implementations • 8 Oct 2018 • Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, Xiong Xiao, Fil Alleva
The goal of this work is to develop a meeting transcription system that can recognize speech even when utterances of different speakers are overlapped.
no code implementations • 21 Nov 2017 • Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan
Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients.
no code implementations • 12 Nov 2013 • Emad M. Grais, Mehmet Umut Sen, Hakan Erdogan
In the training stage, the training data for the source signals are used to train a DNN.
no code implementations • LREC 2012 • Ibrahim Saygin Topkaya, Hakan Erdogan
The main aim of collecting SUTAV database was to obtain a large audio-visual collection of spoken words, numbers and sentences in Turkish language.
no code implementations • 29 Jul 2011 • Ibrahim Saygin Topkaya, Hakan Erdogan
We use two different classifiers, where one comes from a background modeling method, to generate the weight image and we calculate contributions of the classifiers dynamically using their confidences to generate a final weight image to be used in tracking.