Search Results for author: Shinji Watanabe

Found 279 papers, 90 papers with code

Paper
Add Code

Deep clustering: Discriminative embeddings for segmentation and separation

8 code implementations • 18 Aug 2015 • John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe

The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources.

Ranked #30 on Speech Separation on WSJ0-2mix

Clustering Deep Clustering +3

2,107

Paper
Code

Single-Channel Multi-Speaker Separation using Deep Clustering

2 code implementations • 7 Jul 2016 • Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, John R. Hershey

In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

117

Paper
Code

Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning

8 code implementations • 21 Sep 2016 • Suyoun Kim, Takaaki Hori, Shinji Watanabe

Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments.

Multi-Task Learning Speech Recognition

10,142

Paper
Code

Multichannel End-to-end Speech Recognition

no code implementations • ICML 2017 • Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey

The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology.

Language Modelling Speech Enhancement +2

Paper
Add Code

Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

6 code implementations • 8 Jun 2017 • Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan

The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

1,159

Paper
Code

Joint CTC/attention decoding for end-to-end speech recognition

1 code implementation • ACL 2017 • Takaaki Hori, Shinji Watanabe, John Hershey

End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

10,142

Paper
Code

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

no code implementations • 21 Nov 2017 • Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan

Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients.

Robust Speech Recognition speech-recognition

Paper
Add Code

Multi-Modal Data Augmentation for End-to-End ASR

no code implementations • 27 Mar 2018 • Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, Shinji Watanabe

We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using \emph{symbolic} input in addition to the traditional acoustic input.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

no code implementations • 27 Mar 2018 • Szu-Jui Chen, Aswin Shanmugam Subramanian, Hainan Xu, Shinji Watanabe

This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit.

Ranked #2 on Noisy Speech Recognition on CHiME real

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

no code implementations • 28 Mar 2018 • Jon Barker, Shinji Watanabe, Emmanuel Vincent, Jan Trmal

The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

ESPnet: End-to-End Speech Processing Toolkit

no code implementations • 30 Mar 2018 • Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai

This paper introduces a new open source platform for end-to-end speech processing named ESPnet.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Multi-Head Decoder for End-to-End Speech Recognition

no code implementations • 22 Apr 2018 • Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model.

speech-recognition Speech Recognition

Paper
Add Code

A Purely End-to-end System for Multi-speaker Speech Recognition

no code implementations • ACL 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner.

speech-recognition Speech Recognition

Paper
Add Code

Weakly Supervised Deep Recurrent Neural Networks for Basic Dance Step Generation

3 code implementations • 3 Jul 2018 • Nelson Yalta, Shinji Watanabe, Kazuhiro Nakadai, Tetsuya OGATA

However, applying DNNs for generating dance to a piece of music is nevertheless challenging, because of 1) DNNs need to generate large sequences while mapping the music input, 2) the DNN needs to constraint the motion beat to the music, and 3) DNNs require a considerable amount of hand-crafted data.

Motion Estimation

Paper
Code

Low-Resource Contextual Topic Identification on Speech

no code implementations • 17 Jul 2018 • Chunxi Liu, Matthew Wiesner, Shinji Watanabe, Craig Harman, Jan Trmal, Najim Dehak, Sanjeev Khudanpur

In topic identification (topic ID) on real-world unstructured audio, an audio instance of variable topic shifts is first broken into sequential segments, and each segment is independently classified.

General Classification Topic Classification +1

Paper
Add Code

Back-Translation-Style Data Augmentation for End-to-End ASR

no code implementations • 28 Jul 2018 • Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, Kazuya Takeda

In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

End-to-end Speech Recognition with Word-based RNN Language Models

no code implementations • 8 Aug 2018 • Takaaki Hori, Jaejin Cho, Shinji Watanabe

This paper investigates the impact of word-based RNN language models (RNN-LMs) on the performance of end-to-end automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

End-to-End Multi-Lingual Multi-Speaker Speech Recognition

no code implementations • 27 Sep 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

Several multi-lingual ASR systems were recently proposed based on a monolithic neural network architecture without language-dependent modules, showing that modeling of multiple languages is well within the capabilities of an end-to-end framework.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Phasebook and Friends: Leveraging Discrete Representations for Source Separation

no code implementations • 2 Oct 2018 • Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, John R. Hershey

Here, we propose "magbook", "phasebook", and "combook", three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks.

Speaker Separation Speech Enhancement

Paper
Add Code

Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling

no code implementations • 4 Oct 2018 • Jaejin Cho, Murali Karthick Baskar, Ruizhi Li, Matthew Wiesner, Sri Harish Mallidi, Nelson Yalta, Martin Karafiat, Shinji Watanabe, Takaaki Hori

In this work, we attempt to use data from 10 BABEL languages to build a multi-lingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach.

Language Modelling Sequence-To-Sequence Speech Recognition +2

Paper
Add Code

Cycle-consistency training for end-to-end speech recognition

no code implementations • 2 Nov 2018 • Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji Watanabe, Jonathan Le Roux

To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

End-to-End Monaural Multi-speaker ASR System without Pretraining

no code implementations • 5 Nov 2018 • Xuankai Chang, Yanmin Qian, Kai Yu, Shinji Watanabe

The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Transfer learning of language-independent end-to-end ASR with language model fusion

no code implementations • 6 Nov 2018 • Hirofumi Inaguma, Jaejin Cho, Murali Karthick Baskar, Tatsuya Kawahara, Shinji Watanabe

This work explores better adaptation methods to low-resource languages using an external language model (LM) under the framework of transfer learning.

Language Modelling Transfer Learning

Paper
Add Code

Building Corpora for Single-Channel Speech Separation Across Multiple Domains

no code implementations • 6 Nov 2018 • Matthew Maciejewski, Gregory Sell, Leibny Paola Garcia-Perera, Shinji Watanabe, Sanjeev Khudanpur

To date, the bulk of research on single-channel speech separation has been conducted using clean, near-field, read speech, which is not representative of many modern applications.

Speech Separation

Paper
Add Code

Analysis of Multilingual Sequence-to-Sequence speech recognition systems

no code implementations • 7 Nov 2018 • Martin Karafiát, Murali Karthick Baskar, Shinji Watanabe, Takaaki Hori, Matthew Wiesner, Jan "Honza'' Černocký

This paper investigates the applications of various multilingual approaches developed in conventional hidden Markov model (HMM) systems to sequence-to-sequence (seq2seq) automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Promising Accurate Prefix Boosting for sequence-to-sequence ASR

no code implementations • 7 Nov 2018 • Murali Karthick Baskar, Lukáš Burget, Shinji Watanabe, Martin Karafiát, Takaaki Hori, Jan Honza Černocký

In this paper, we present promising accurate prefix boosting (PAPB), a discriminative training technique for attention based sequence-to-sequence (seq2seq) ASR.

Paper
Add Code

CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

no code implementations • 7 Nov 2018 • Nelson Yalta, Shinji Watanabe, Takaaki Hori, Kazuhiro Nakadai, Tetsuya OGATA

By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this study attempts to overcome the presents difficulties in everyday environments.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling

no code implementations • 10 Nov 2018 • Hainan Xu, Shuoyang Ding, Shinji Watanabe

Most end-to-end speech recognition systems model text directly as a sequence of characters or sub-words.

Automatic Speech Recognition (ASR) speech-recognition

Paper
Add Code

Multi-encoder multi-resolution framework for end-to-end speech recognition

no code implementations • 12 Nov 2018 • Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

In this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework based on the joint CTC/Attention model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition

no code implementations • 12 Nov 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe

In this paper, we propose a parallelism technique for beam search, which accelerates the search process by vectorizing multiple hypotheses to eliminate the for-loop program.

speech-recognition Speech Recognition

Paper
Add Code

Stream attention-based multi-array end-to-end speech recognition

no code implementations • 12 Nov 2018 • Xiaofei Wang, Ruizhi Li, Sri Harish Mallid, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Pretraining by Backtranslation for End-to-end ASR in Low-Resource Settings

no code implementations • 10 Dec 2018 • Matthew Wiesner, Adithya Renduchintala, Shinji Watanabe, Chunxi Liu, Najim Dehak, Sanjeev Khudanpur

Using transcribed speech from nearby languages gives a further 20-30% relative reduction in character error rate.

Data Augmentation

Paper
Add Code

Massively Multilingual Adversarial Speech Recognition

no code implementations • NAACL 2019 • Oliver Adams, Matthew Wiesner, Shinji Watanabe, David Yarowsky

We report on adaptation of multilingual end-to-end speech recognition models trained on as many as 100 languages.

General Classification speech-recognition +1

Paper
Add Code

An Investigation of End-to-End Multichannel Speech Recognition for Reverberant and Mismatch Conditions

no code implementations • 19 Apr 2019 • Aswin Shanmugam Subramanian, Xiaofei Wang, Shinji Watanabe, Toru Taniguchi, Dung Tran, Yuya Fujita

This report investigates the ability of E2E ASR from standard close-talk to far-field applications by encompassing entire multichannel speech enhancement and ASR components within the S2S model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

no code implementations • 30 Apr 2019 • Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki Hori, Lukáš Burget, Jan Černocký

Such techniques derive training procedures and losses able to leverage unpaired speech and/or text data by combining ASR with Text-to-Speech (TTS) models.

Ranked #33 on Semi-Supervised Image Classification on ImageNet - 10% labeled data (Top 5 Accuracy metric)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Multi-Stream End-to-End Speech Recognition

no code implementations • 17 Jun 2019 • Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Shinji Watanabe, Takaaki Hori, Hynek Hermansky

Two representative framework have been proposed and discussed, which are Multi-Encoder Multi-Resolution (MEM-Res) framework and Multi-Encoder Multi-Array (MEM-Array) framework, respectively.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

no code implementations • 26 Jun 2019 • Naoyuki Kanda, Shota Horiguchi, Ryoichi Takashima, Yusuke Fujita, Kenji Nagamatsu, Shinji Watanabe

In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

End-to-End Neural Speaker Diarization with Permutation-Free Objectives

1 code implementation • 12 Sep 2019 • Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe

To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem.

Ranked #6 on Speaker Diarization on CALLHOME

Clustering Domain Adaptation +3

347

Paper
Code

End-to-End Neural Speaker Diarization with Self-attention

2 code implementations • 13 Sep 2019 • Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe

Our method was even better than that of the state-of-the-art x-vector clustering-based method.

Ranked #2 on Speaker Diarization on CALLHOME

Clustering speaker-diarization +1

347

Paper
Code

A Comparative Study on Transformer vs RNN in Speech Applications

1 code implementation • 13 Sep 2019 • Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang

Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS).

Ranked #12 on Speech Recognition on AISHELL-1

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

7,875

Paper
Code

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

no code implementations • 17 Sep 2019 • Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe

Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2. 1 % from that of TS-ASR given oracle speaker embeddings.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

1 code implementation • 18 Sep 2019 • Yiming Wang, Tongfei Chen, Hainan Xu, Shuoyang Ding, Hang Lv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe, Sanjeev Khudanpur

We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq.

Ranked #1 on Speech Recognition on Hub5'00 CallHome

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

941

Paper
Code

Multilingual End-to-End Speech Translation

1 code implementation • 1 Oct 2019 • Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

In this paper, we propose a simple yet effective framework for multilingual end-to-end speech translation (ST), in which speech utterances in source languages are directly translated to the desired target languages with a universal sequence-to-sequence architecture.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

7,875

Paper
Code

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

no code implementations • 15 Oct 2019 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition.

speech-recognition Speech Recognition +1

Paper
Add Code

Transformer ASR with Contextual Block Processing

no code implementations • 16 Oct 2019 • Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

In this paper, we propose a new block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

A practical two-stage training strategy for multi-stream end-to-end speech recognition

no code implementations • 23 Oct 2019 • Ruizhi Li, Gregory Sell, Xiaofei Wang, Shinji Watanabe, Hynek Hermansky

The multi-stream paradigm of audio processing, in which several sources are simultaneously considered, has been an active research area for information fusion.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

3 code implementations • 24 Oct 2019 • Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan

Furthermore, the unified design enables the integration of ASR functions with TTS, e. g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

7,875

Paper
Code

Towards Online End-to-end Transformer Automatic Speech Recognition

no code implementations • 25 Oct 2019 • Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

In this paper, we extend it towards an entire online E2E ASR system by introducing an online decoding process inspired by monotonic chunkwise attention (MoChA) into the Transformer decoder.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition

no code implementations • 10 Nov 2019 • Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak

In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

no code implementations • 18 Nov 2019 • Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, Kevin Wilson, Desh Raj, Shinji Watanabe, Zhuo Chen, John R. Hershey

This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation.

Speaker Separation Speech Enhancement +3

Paper
Add Code

End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection

no code implementations • 3 Feb 2020 • Takenori Yoshimura, Tomoki Hayashi, Kazuya Takeda, Shinji Watanabe

The proposed method is publicly available.

Action Detection Activity Detection +3

Paper
Add Code

End-to-End Multi-speaker Speech Recognition with Transformer

no code implementations • 10 Feb 2020 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios.

speech-recognition Speech Recognition

Paper
Add Code

Speaker Diarization with Region Proposal Network

1 code implementation • 14 Feb 2020 • Zili Huang, Shinji Watanabe, Yusuke Fujita, Paola Garcia, Yiwen Shao, Daniel Povey, Sanjeev Khudanpur

Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the "who spoke when" problem.

Region Proposal speaker-diarization +1

Paper
Code

End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

1 code implementation • 24 Feb 2020 • Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu

However, the clustering-based approach has a number of problems; i. e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps.

Clustering General Classification +3

Paper
Code

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

no code implementations • 20 Apr 2020 • Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6).

speaker-diarization Speaker Diarization +4

Paper
Add Code

ESPnet-ST: All-in-One Speech Translation Toolkit

1 code implementation • ACL 2020 • Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Enrique Yalta Soplin, Tomoki Hayashi, Shinji Watanabe

We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

7,875

Paper
Code

DiscreTalk: Text-to-Speech as a Machine Translation Problem

no code implementations • 12 May 2020 • Tomoki Hayashi, Shinji Watanabe

This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

no code implementations • 18 May 2020 • Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori Kobayashi

In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC.

Audio and Speech Processing Sound

Paper
Add Code

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

3 code implementations • 20 May 2020 • Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu

End-to-end speaker diarization for an unknown number of speakers is addressed in this paper.

Clustering speaker-diarization +1

347

Paper
Code

Insertion-Based Modeling for End-to-End Automatic Speech Recognition

no code implementations • 27 May 2020 • Yuya Fujita, Shinji Watanabe, Motoi Omachi, Xuankai Chan

One NAT model, mask-predict, has been applied to ASR but the model needs some heuristics or additional component to estimate the length of the output token sequence.

Audio and Speech Processing Sound

Paper
Add Code

Neural Speaker Diarization with Speaker-Wise Chain Rule

1 code implementation • 2 Jun 2020 • Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, Kenji Nagamatsu

Speaker diarization is an essential step for processing multi-speaker audio.

speaker-diarization Speaker Diarization

347

Paper
Code

Online End-to-End Neural Diarization with Speaker-Tracing Buffer

no code implementations • 4 Jun 2020 • Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu

This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND).

speaker-diarization Speaker Diarization

Paper
Add Code

Speaker-Conditional Chain Model for Speech Separation and Extraction

no code implementations • 25 Jun 2020 • Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe, Bo Xu

With the predicted speaker information from whole observation, our model is helpful to solve the problem of conventional speech separation and speaker extraction for multi-round long recordings.

Audio and Speech Processing Sound

Paper
Add Code

Streaming Transformer ASR with Blockwise Synchronous Inference

no code implementations • 25 Jun 2020 • Emiru Tsunoo, Yosuke Kashiwagi, Shinji Watanabe

In this paper, we extend block processing towards an entire streaming E2E ASR system without additional training, by introducing a blockwise synchronous decoding process inspired by a neural transducer into the Transformer decoder.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

no code implementations • NeurIPS 2020 • Jing Shi, Xuankai Chang, Pengcheng Guo, Shinji Watanabe, Yusuke Fujita, Jiaming Xu, Bo Xu, Lei Xie

This model additionally has a simple and efficient stop criterion for the end of the transduction, making it able to infer the variable number of output sequences.

Ranked #3 on Speech Separation on WSJ0-4mix

speech-recognition Speech Recognition +1

Paper
Add Code

Augmentation adversarial training for self-supervised speaker recognition

no code implementations • 23 Jul 2020 • Jaesung Huh, Hee Soo Heo, Jingu Kang, Shinji Watanabe, Joon Son Chung

Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general.

Contrastive Learning Speaker Recognition

Paper
Add Code

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

3 code implementations • 6 Oct 2020 • Wen-Chin Huang, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda

This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

7,875

Paper
Code

Learning Speaker Embedding from Text-to-Speech

1 code implementation • 21 Oct 2020 • Jaejin Cho, Piotr Zelasko, Jesus Villalba, Shinji Watanabe, Najim Dehak

TTS with speaker classification loss improved EER by 0. 28\% and 0. 73\% absolutely from a model using only speaker classification loss in LibriTTS and Voxceleb1 respectively.

Classification General Classification +2

Paper
Code

Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

no code implementations • 25 Oct 2020 • Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems.

Translation

Paper
Add Code

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

no code implementations • 26 Oct 2020 • Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, Tetsunori Kobayashi

While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization

no code implementations • 30 Oct 2020 • Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe, Meng Yu, Yong Xu, Shi-Xiong Zhang, Dong Yu

The advantages of D-ASR over existing methods are threefold: (1) it provides explicit speaker locations, (2) it improves the explainability factor, and (3) it achieves better ASR performance as the process is more streamlined.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

DOVER-Lap: A Method for Combining Overlap-aware Diarization Outputs

1 code implementation • 3 Nov 2020 • Desh Raj, Leibny Paola Garcia-Perera, Zili Huang, Shinji Watanabe, Daniel Povey, Andreas Stolcke, Sanjeev Khudanpur

Several advances have been made recently towards handling overlapping speech for speaker diarization.

Audio and Speech Processing Sound

Paper
Code

Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

no code implementations • 3 Nov 2020 • Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, Jinyu Li, Scott Wisdom, John R. Hershey

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information

no code implementations • 15 Nov 2020 • Yen-Ju Lu, Chia-Yu Chang, Cheng Yu, Ching-Feng Liu, Jeih-weih Hung, Shinji Watanabe, Yu Tsao

Experimental results from speech denoising, speech dereverberation, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation

no code implementations • 26 Nov 2020 • Jiatong Shi, Chunlei Zhang, Chao Weng, Shinji Watanabe, Meng Yu, Dong Yu

Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers.

Speech Enhancement Speech Extraction +1 Sound Audio and Speech Processing

Paper
Add Code

Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording

no code implementations • 17 Dec 2020 • Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen

Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years.

Clustering Speech Separation

Paper
Add Code

Toward Streaming ASR with Non-Autoregressive Insertion-based Model

no code implementations • 18 Dec 2020 • Yuya Fujita, Tianzi Wang, Shinji Watanabe, Motoi Omachi

We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

End-to-End Speaker Diarization as Post-Processing

no code implementations • 18 Dec 2020 • Shota Horiguchi, Paola Garcia, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu

Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker.

Clustering Multi-Label Classification +2

Paper
Add Code

Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers

no code implementations • 21 Jan 2021 • Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Paola Garcia, Kenji Nagamatsu

We propose a streaming diarization method based on an end-to-end neural diarization (EEND) model, which handles flexible numbers of speakers and overlapping speech.

Speaker Diarization Sound Audio and Speech Processing

Paper
Add Code

Arabic Speech Recognition by End-to-End, Modular Systems and Human

1 code implementation • 21 Jan 2021 • Amir Hussein, Shinji Watanabe, Ahmed Ali

Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers to debate if the machine has reached human performance.

Arabic Speech Recognition Automatic Speech Recognition +3

7,877

Paper
Code

Understanding the Tradeoffs in Client-side Privacy for Downstream Speech Tasks

2 code implementations • 22 Jan 2021 • Peter Wu, Paul Pu Liang, Jiatong Shi, Ruslan Salakhutdinov, Shinji Watanabe, Louis-Philippe Morency

As users increasingly rely on cloud-based computing services, it is important to ensure that uploaded speech data remains private.

Representation Learning speech-recognition +1

Paper
Code

A Review of Speaker Diarization: Recent Advances with Deep Learning

no code implementations • 24 Jan 2021 • Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, Shrikanth Narayanan

Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when".

Retrieval speaker-diarization +3

Paper
Add Code

Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec

no code implementations • 26 Jan 2021 • Jiatong Shi, Jonathan D. Amith, Rey Castillo García, Esteban Guadalupe Sierra, Kevin Duh, Shinji Watanabe

"Transcription bottlenecks", created by a shortage of effective human transcribers are one of the main challenges to endangered language (EL) documentation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

no code implementations • 2 Feb 2021 • Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge.

Clustering

Paper
Add Code

Intermediate Loss Regularization for CTC-based Speech Recognition

no code implementations • 5 Feb 2021 • Jaesong Lee, Shinji Watanabe

In addition, we propose to combine this intermediate CTC loss with stochastic depth training, and apply this combination to a recently proposed Conformer network.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition

no code implementations • 16 Feb 2021 • Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe, Meng Yu, Dong Yu

In addition to using the prediction error as a metric for evaluating our localization model, we also establish its potency as a frontend with automatic speech recognition (ASR) as the downstream task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

no code implementations • 18 Feb 2021 • Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Dual-Path Modeling for Long Recording Speech Separation in Meetings

no code implementations • 23 Feb 2021 • Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian

A transformer-based dual-path system is proposed, which integrates transform layers for global modeling.

Speech Separation

Paper
Add Code

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

no code implementations • 23 Feb 2021 • Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian

Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions.

Action Detection Activity Detection +4

Paper
Add Code

Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yol\'oxochitl Mixtec

no code implementations • EACL 2021 • Jiatong Shi, Jonathan D. Amith, Rey Castillo Garc{\'\i}a, Esteban Guadalupe Sierra, Kevin Duh, Shinji Watanabe

{``}Transcription bottlenecks{''}, created by a shortage of effective human transcribers (i. e., transcriber shortage), are one of the main challenges to endangered language (EL) documentation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

INTERSPEECH 2021 ConferencingSpeech Challenge: Towards Far-field Multi-Channel Speech Enhancement for Video Conferencing

1 code implementation • 2 Apr 2021 • Wei Rao, Yihui Fu, Yanxin Hu, Xin Xu, Yvkai Jv, Jiangyu Han, Zhongjie Jiang, Lei Xie, Yannan Wang, Shinji Watanabe, Zheng-Hua Tan, Hui Bu, Tao Yu, Shidong Shang

The ConferencingSpeech 2021 challenge is proposed to stimulate research on far-field multi-channel speech enhancement for video conferencing.

Speech Enhancement Task 2

Paper
Code

SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

1 code implementation • 5 Apr 2021 • Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko

In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models.

Ranked #3 on Speech Recognition on SPGISpeech

speech-recognition Speech Recognition

7,877

Paper
Code

Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation

no code implementations • NAACL 2021 • Hirofumi Inaguma, Tatsuya Kawahara, Shinji Watanabe

To leverage the full potential of the source language information, we propose backward SeqKD, SeqKD from a target-to-source backward NMT model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

1 code implementation • 13 Apr 2021 • Murali Karthick Baskar, Lukáš Burget, Shinji Watanabe, Ramon Fernandez Astudillo, Jan "Honza'' Černocký

Self-supervised ASR-TTS models suffer in out-of-domain data conditions.

Language Modelling speech-recognition +1

Paper
Code

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

no code implementations • NAACL 2021 • Siddharth Dalmia, Brian Yan, Vikas Raunak, Florian Metze, Shinji Watanabe

In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks.

speech-recognition Speech Recognition +1

Paper
Add Code

SUPERB: Speech processing Universal PERformance Benchmark

5 code implementations • 3 May 2021 • Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-Yi Lee

SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.

Representation Learning Self-Supervised Learning

2,092

Paper
Code

End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

no code implementations • 5 May 2021 • Soumi Maiti, Hakan Erdogan, Kevin Wilson, Scott Wisdom, Shinji Watanabe, John R. Hershey

We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.

Clustering Speaker Identification +1

Paper
Add Code

Self-Guided Curriculum Learning for Neural Machine Translation

no code implementations • ACL (IWSLT) 2021 • Lei Zhou, Liang Ding, Kevin Duh, Shinji Watanabe, Ryohei Sasano, Koichi Takeda

In the field of machine learning, the well-trained model is assumed to be able to recover the training labels, i. e. the synthetic labels predicted by the model should be as close to the ground-truth labels as possible.

Machine Translation NMT +2

Paper
Add Code

End-to-end ASR to jointly predict transcriptions and linguistic annotations

no code implementations • NAACL 2021 • Motoi Omachi, Yuya Fujita, Shinji Watanabe, Matthew Wiesner

We propose a Transformer-based sequence-to-sequence model for automatic speech recognition (ASR) capable of simultaneously transcribing and annotating audio with linguistic information such as phonemic transcripts or part-of-speech (POS) tags.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios

no code implementations • 7 Jun 2021 • Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe

Although end-to-end automatic speech recognition (E2E ASR) has achieved great performance in tasks that have numerous paired data, it is still challenging to make E2E ASR robust against noisy and low-resource conditions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

no code implementations • 8 Jun 2021 • Yuki Takashima, Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Paola García, Kenji Nagamatsu

In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND).

Clustering speaker-diarization +1

Paper
Add Code

Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization

no code implementations • 9 Jun 2021 • Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Paola García, Kenji Nagamatsu

To evaluate our proposed method, we conduct the experiments of model adaptation using labeled and unlabeled data.

Clustering Pseudo Label

Paper
Add Code

Leveraging Pre-trained Language Model for Speech Sentiment Analysis

no code implementations • 11 Jun 2021 • Suwon Shon, Pablo Brusco, Jing Pan, Kyu J. Han, Shinji Watanabe

In this paper, we explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

2 code implementations • 13 Jun 2021 • Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10, 000 hours of high quality labeled audio suitable for supervised training, and 40, 000 hours of total audio suitable for semi-supervised and unsupervised training.

Ranked #1 on Speech Recognition on GigaSpeech

Sentence speech-recognition +1

600

Paper
Code

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

1 code implementation • 16 Jun 2021 • Pengcheng Guo, Xuankai Chang, Shinji Watanabe, Lei Xie

Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19. 9% and 34. 3% on WSJ0-2mix and WSJ0-3mix sets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

Multi-mode Transformer Transducer with Stochastic Future Context

no code implementations • 17 Jun 2021 • Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

A Multi-mode ASR model can fulfill various latency requirements during inference -- when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Layer Pruning on Demand with Intermediate CTC

no code implementations • 17 Jun 2021 • Jaesong Lee, Jingu Kang, Shinji Watanabe

Deploying an end-to-end automatic speech recognition (ASR) model on mobile/embedded devices is a challenging task, since the device computational power and energy consumption requirements are dynamically changed in practice.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Encoder-Decoder Based Attractors for End-to-End Neural Diarization

no code implementations • 20 Jun 2021 • Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Paola Garcia

Diarization results are then estimated as dot products of the attractors and embeddings.

speaker-diarization Speaker Diarization

Paper
Add Code

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding

no code implementations • 29 Jun 2021 • Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe, Alan W Black

Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

ESPnet-ST IWSLT 2021 Offline Speech Translation System

no code implementations • ACL (IWSLT) 2021 • Hirofumi Inaguma, Brian Yan, Siddharth Dalmia, Pengcheng Guo, Jiatong Shi, Kevin Duh, Shinji Watanabe

This year we made various efforts on training data, architecture, and audio segmentation.

Knowledge Distillation speech-recognition +2

Paper
Add Code

Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors

no code implementations • 4 Jul 2021 • Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yawen Xue, Yuki Takashima, Yohei Kawaguchi

This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited.

Clustering

Paper
Add Code

Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

1 code implementation • 20 Jul 2021 • Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe

Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

7,875

Paper
Code

On Prosody Modeling for ASR+TTS based Voice Conversion

no code implementations • 20 Jul 2021 • Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda

In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Differentiable Allophone Graphs for Language-Universal Speech Recognition

1 code implementation • 24 Jul 2021 • Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe

These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language.

speech-recognition Speech Recognition

Paper
Code

A Study on Speech Enhancement Based on Diffusion Probabilistic Model

1 code implementation • 25 Jul 2021 • Yen-Ju Lu, Yu Tsao, Shinji Watanabe

Based on this property, we propose a diffusion probabilistic model-based speech enhancement (DiffuSE) model that aims to recover clean speech signals from noisy signals.

Speech Enhancement

189

Paper
Code

Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker

no code implementations • 7 Aug 2021 • Maokui He, Desh Raj, Zili Huang, Jun Du, Zhuo Chen, Shinji Watanabe

Target-speaker voice activity detection (TS-VAD) has recently shown promising results for speaker diarization on highly overlapped speech.

Action Detection Activity Detection +3

Paper
Add Code

Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

no code implementations • 9 Sep 2021 • Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.

Language Modelling Translation

Paper
Add Code

Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates

1 code implementation • 27 Sep 2021 • Hirofumi Inaguma, Siddharth Dalmia, Brian Yan, Shinji Watanabe

We propose Fast-MD, a fast MD model that generates HI by non-autoregressive (NAR) decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

9,781

Paper
Code

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

no code implementations • 9 Oct 2021 • Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-Yi Lee, Shinji Watanabe

We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Multi-Channel End-to-End Neural Diarization with Distributed Microphones

no code implementations • 10 Oct 2021 • Shota Horiguchi, Yuki Takashima, Paola Garcia, Shinji Watanabe, Yohei Kawaguchi

With simulated and real-recorded datasets, we demonstrated that the proposed method outperformed conventional EEND when a multi-channel input was given while maintaining comparable performance with a single-channel input.

speaker-diarization Speaker Diarization

Paper
Add Code

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

no code implementations • 11 Oct 2021 • Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe

Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition

no code implementations • 11 Oct 2021 • Jing Pan, Tao Lei, Kwangyoun Kim, Kyu Han, Shinji Watanabe

The Transformer architecture has been well adopted as a dominant architecture in most sequence transduction tasks including automatic speech recognition (ASR), since its attention mechanism excels in capturing long-range dependencies.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

2 code implementations • 12 Oct 2021 • Wen-Chin Huang, Shu-wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting.

Benchmarking Voice Conversion

2,092

Paper
Code

ESPnet2-TTS: Extending the Edge of TTS Research

1 code implementation • 15 Oct 2021 • Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe

This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit.

7,875

Paper
Code

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

no code implementations • 27 Oct 2021 • Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian

The deep learning based time-domain models, e. g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement.

Speech Enhancement speech-recognition +1

Paper
Add Code

TorchAudio: Building Blocks for Audio and Speech Processing

2 code implementations • 28 Oct 2021 • Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z. Yang, Jason Lian, Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Goldsborough, Prabhat Roy, Sean Narenthiran, Shinji Watanabe, Soumith Chintala, Vincent Quenneville-Bélair, Yangyang Shi

This document describes version 0. 10 of TorchAudio: building blocks for machine learning applications in the audio and speech processing domain.

BIG-bench Machine Learning valid

2,379

Paper
Code

Sequence Transduction with Graph-based Supervision

no code implementations • 1 Nov 2021 • Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

The recurrent neural network transducer (RNN-T) objective plays a major role in building today's best automatic speech recognition (ASR) systems for production.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity

1 code implementation • 2 Nov 2021 • Peter Wu, Jiatong Shi, Yifan Zhong, Shinji Watanabe, Alan W Black

We demonstrate the effectiveness of our approach in language family classification, speech recognition, and speech synthesis tasks.

Cross-Lingual Transfer speech-recognition +2

Paper
Code

Attention-based Multi-hypothesis Fusion for Speech Summarization

2 code implementations • 16 Nov 2021 • Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe

We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Code

ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet

2 code implementations • 29 Nov 2021 • Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan, Brian Yan, Ngoc Thang Vu, Alan W Black, Shinji Watanabe

However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks.

Spoken Language Understanding

7,875

Paper
Code

Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

no code implementations • 29 Nov 2021 • Brian Yan, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe, Dong Yu

Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type.

speech-recognition Speech Recognition

Paper
Add Code

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

no code implementations • 17 Dec 2021 • Jing Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu

Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification.

regression Speech Separation

Paper
Add Code

A Study of Transducer based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies

no code implementations • 14 Jan 2022 • Florian Boyer, Yusuke Shinohara, Takaaki Ishii, Hirofumi Inaguma, Shinji Watanabe

In this study, we present recent developments of models trained with the RNN-T loss in ESPnet.

Multi-Task Learning

Paper
Add Code

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

no code implementations • 25 Jan 2022 • Keqi Deng, Zehui Yang, Shinji Watanabe, Yosuke Higuchi, Gaofeng Cheng, Pengyuan Zhang

The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR

no code implementations • 25 Jan 2022 • Emiru Tsunoo, Chaitanya Narisetty, Michael Hentschel, Yosuke Kashiwagi, Shinji Watanabe

To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Joint Speech Recognition and Audio Captioning

no code implementations • 3 Feb 2022 • Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe

A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions.

AudioCaps Audio captioning +4

Paper
Add Code

Conditional Diffusion Probabilistic Model for Speech Enhancement

2 code implementations • 10 Feb 2022 • Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, Yu Tsao

Speech enhancement is a critical component of many user-oriented audio applications, yet current systems still suffer from distorted and unnatural outputs.

Speech Enhancement Speech Synthesis

189

Paper
Code

Acoustic Event Detection with Classifier Chains

no code implementations • 17 Feb 2022 • Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi

In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule to form classifier chains.

Event Detection

Paper
Add Code

Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge

no code implementations • 24 Feb 2022 • Yen-Ju Lu, Samuele Cornell, Xuankai Chang, Wangyou Zhang, Chenda Li, Zhaoheng Ni, Zhong-Qiu Wang, Shinji Watanabe

This paper describes our submission to the L3DAS22 Challenge Task 1, which consists of speech enhancement with 3D Ambisonic microphones.

Speech Enhancement

Paper
Add Code

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

no code implementations • 1 Mar 2022 • Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

HEAR: Holistic Evaluation of Audio Representations

3 code implementations • 6 Mar 2022 • Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk

The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios.

Open-Ended Question Answering

Paper
Code

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

1 code implementation • ACL 2022 • Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy T. Liu, Cheng-I Jeff Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-Yi Lee

In this paper, we introduce SUPERB-SG, a new benchmark focused on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB.

Self-Supervised Learning Transfer Learning

2,092

Paper
Code

Investigating self-supervised learning for speech enhancement and separation

no code implementations • 15 Mar 2022 • Zili Huang, Shinji Watanabe, Shu-wen Yang, Paola Garcia, Sanjeev Khudanpur

Speech enhancement and separation are two fundamental tasks for robust speech processing.

Self-Supervised Learning Speech Enhancement +1

Paper
Add Code

Memory-Efficient Training of RNN-Transducer with Sampled Softmax

no code implementations • 31 Mar 2022 • Jaesong Lee, Lukas Lee, Shinji Watanabe

RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

1 code implementation • 31 Mar 2022 • Soumi Maiti, Yushi Ueda, Shinji Watanabe, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Yong Xu

In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting.

speaker-diarization Speaker Diarization +1

7,875

Paper
Code

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy

no code implementations • 31 Mar 2022 • Shuai Guo, Jiatong Shi, Tao Qian, Shinji Watanabe, Qin Jin

Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities, compared to conventional statistical parametric based methods.

Data Augmentation Singing Voice Synthesis

Paper
Add Code

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

no code implementations • 1 Apr 2022 • Xuankai Chang, Takashi Maekaku, Yuya Fujita, Shinji Watanabe

This work presents our end-to-end (E2E) automatic speech recognition (ASR) model targetting at robust speech recognition, called Integraded speech Recognition with enhanced speech Input for Self-supervised learning representation (IRIS).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

End-to-End Multi-speaker ASR with Independent Vector Analysis

no code implementations • 1 Apr 2022 • Robin Scheibler, Wangyou Zhang, Xuankai Chang, Shinji Watanabe, Yanmin Qian

We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Better Intermediates Improve CTC Inference

no code implementations • 1 Apr 2022 • Tatsuya Komatsu, Yusuke Fujita, Jaesong Lee, Lukas Lee, Shinji Watanabe, Yusuke Kida

This paper proposes a method for improved CTC inference with searched intermediates and multi-pass conditioning.

Paper
Add Code

Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation

1 code implementation • 5 Apr 2022 • Dan Berrebbi, Jiatong Shi, Brian Yan, Osbel Lopez-Francisco, Jonathan D. Amith, Shinji Watanabe

The present work examines the assumption that combining non-learnable SF extractors to SSL models is an effective approach to low resource speech tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

7,875

Paper
Code

Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation

no code implementations • 19 Apr 2022 • Keqi Deng, Shinji Watanabe, Jiatong Shi, Siddhant Arora

Although Transformers have gained success in several speech processing tasks like spoken language understanding (SLU) and speech translation (ST), achieving online processing while keeping competitive performance is still essential for real-world interaction.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

1 code implementation • 2 May 2022 • Felix Wu, Kwangyoun Kim, Shinji Watanabe, Kyu Han, Ryan Mcdonald, Kilian Q. Weinberger, Yoav Artzi

We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.

Ranked #3 on Named Entity Recognition (NER) on SLUE

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

Paper
Code

Self-Supervised Speech Representation Learning: A Review

no code implementations • 21 May 2022 • Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

no code implementations • 6 Jun 2022 • Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yuki Takashima, Yohei Kawaguchi

Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND.

Multi-Label Classification speaker-diarization +1

Paper
Add Code

LegoNN: Building Modular Encoder-Decoder Models

no code implementations • 7 Jun 2022 • Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed

We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Residual Language Model for End-to-end Speech Recognition

no code implementations • 15 Jun 2022 • Emiru Tsunoo, Yosuke Kashiwagi, Chaitanya Narisetty, Shinji Watanabe

In this paper, we propose a simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models

no code implementations • 1 Jul 2022 • Yuki Takashima, Shota Horiguchi, Shinji Watanabe, Paola García, Yohei Kawaguchi

In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Improving Speech Enhancement through Fine-Grained Speech Characteristics

1 code implementation • 1 Jul 2022 • Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

We first identify key acoustic parameters that have been found to correlate well with voice quality (e. g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.

Speech Enhancement

Paper
Code

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

4 code implementations • 6 Jul 2022 • Yifan Peng, Siddharth Dalmia, Ian Lane, Shinji Watanabe

Conformer has proven to be effective in many speech processing tasks.

speech-recognition Speech Recognition +1

7,875

Paper
Code

Online Continual Learning of End-to-End Speech Recognition Models

no code implementations • 11 Jul 2022 • Muqiao Yang, Ian Lane, Shinji Watanabe

Continual Learning, also known as Lifelong Learning, aims to continually learn from new data as it becomes available.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Two-Pass Low Latency End-to-End Spoken Language Understanding

no code implementations • 14 Jul 2022 • Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan Black, Shinji Watanabe

End-to-end (E2E) models are becoming increasingly popular for spoken language understanding (SLU) systems and are beginning to achieve competitive performance to pipeline-based approaches.

speech-recognition Speech Recognition +2

Paper
Add Code

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

1 code implementation • 19 Jul 2022 • Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

To showcase such integration, we performed experiments on carefully designed synthetic datasets for noisy-reverberant multi-channel ST and SLU tasks, which can be used as benchmark corpora for future research.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

7,875

Paper
Code

When Is TTS Augmentation Through a Pivot Language Useful?

1 code implementation • 20 Jul 2022 • Nathaniel Robinson, Perez Ogayo, Swetha Gangu, David R. Mortensen, Shinji Watanabe

Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

no code implementations • 3 Aug 2022 • Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses.

Language Modelling

Paper
Add Code

ASR2K: Speech Recognition for Around 2000 Languages without Audio

1 code implementation • 6 Sep 2022 • Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji Watanabe

We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

Language Modelling Speech Recognition

Paper
Code

Deep Speech Synthesis from Articulatory Representations

1 code implementation • 13 Sep 2022 • Peter Wu, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract.

Speech Synthesis

Paper
Code

ESPnet-ONNX: Bridging a Gap Between Research and Production

1 code implementation • 20 Sep 2022 • Masao Someki, Yosuke Higuchi, Tomoki Hayashi, Shinji Watanabe

In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks.

Spoken Language Understanding

145

Paper
Code

E-Branchformer: Branchformer with Enhanced merging for speech recognition

1 code implementation • 30 Sep 2022 • Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR).

Ranked #9 on Speech Recognition on LibriSpeech test-other

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

7,875

Paper
Code

Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization

no code implementations • 7 Oct 2022 • Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia

This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately.

Knowledge Distillation speaker-diarization +2

Paper
Add Code

CTC Alignments Improve Autoregressive Translation

no code implementations • 11 Oct 2022 • Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

On Compressing Sequences for Self-Supervised Speech Models

no code implementations • 13 Oct 2022 • Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-Yi Lee, Hao Tang

Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference.

Self-Supervised Learning

Paper
Add Code

Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

no code implementations • 14 Oct 2022 • Jinchuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units.

Paper
Add Code

SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning

no code implementations • 16 Oct 2022 • Tzu-hsun Feng, Annie Dong, Ching-Feng Yeh, Shu-wen Yang, Tzu-Quan Lin, Jiatong Shi, Kai-Wei Chang, Zili Huang, Haibin Wu, Xuankai Chang, Shinji Watanabe, Abdelrahman Mohamed, Shang-Wen Li, Hung-Yi Lee

We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency.

Audio Generation Representation Learning +2

Paper
Add Code

Large-scale learning of generalised representations for speaker recognition

no code implementations • 20 Oct 2022 • Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesong Lee, Hye-jin Shim, Youngki Kwon, Joon Son Chung, Shinji Watanabe

We also show that training with proposed large data configurations gives better performance.

Inductive Bias Speaker Recognition

Paper
Add Code

In search of strong embedding extractors for speaker diarisation

no code implementations • 26 Oct 2022 • Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.

Data Augmentation Speaker Verification

Paper
Add Code

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

1 code implementation • 27 Oct 2022 • Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe

End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation.

named-entity-recognition Named Entity Recognition +2

7,875

Paper
Code

Articulatory Representation Learning Via Joint Factor Analysis and Neural Matrix Factorization

no code implementations • 29 Oct 2022 • Jiachen Lian, Alan W Black, Yijing Lu, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

In this work, we propose a novel articulatory representation decomposition algorithm that takes the advantage of guided factor analysis to derive the articulatory-specific factors and factor scores.

Representation Learning

Paper
Add Code

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

no code implementations • 29 Oct 2022 • Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC).

Language Modelling speech-recognition +2

Paper
Add Code

Avoid Overthinking in Self-Supervised Models for Speech Recognition

no code implementations • 1 Nov 2022 • Dan Berrebbi, Brian Yan, Shinji Watanabe

Although popular for classification tasks in vision and language, EE has seen less use for sequence-to-sequence speech recognition (ASR) tasks where outputs from early layers are often degenerate.

Self-Supervised Learning Sequence-To-Sequence Speech Recognition +1

Paper
Add Code

Towards Zero-Shot Code-Switched Speech Recognition

no code implementations • 2 Nov 2022 • Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

1 code implementation • 2 Nov 2022 • Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

no code implementations • 2 Nov 2022 • Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Multi-blank Transducers for Speech Recognition

1 code implementation • 4 Nov 2022 • Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

10,062

Paper
Code

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

no code implementations • 4 Nov 2022 • Yusuke Shinohara, Shinji Watanabe

In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models.

speech-recognition Speech Recognition

Paper
Add Code

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

no code implementations • 6 Nov 2022 • Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia, Shinji Watanabe, Ann Lee, Hung-Yi Lee

To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

no code implementations • 10 Nov 2022 • Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe

Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

Paper
Add Code

Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation

no code implementations • 11 Nov 2022 • Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe

To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

1 code implementation • 12 Nov 2022 • Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

To address these issues, we devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.

Voice Conversion

Paper
Code

Streaming Joint Speech Recognition and Disfluency Detection

1 code implementation • 16 Nov 2022 • Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao Okuda, Siddhant Arora, Shinji Watanabe

In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner.

Language Modelling speech-recognition +1

Paper
Code

EURO: ESPnet Unsupervised ASR Open-source Toolkit

1 code implementation • 30 Nov 2022 • Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola Garcia, Hung-Yi Lee, Shinji Watanabe, Sanjeev Khudanpur

This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

7,875

Paper
Code

SpeechLMScore: Evaluating speech generation using speech language model

2 code implementations • 8 Dec 2022 • Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe

While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming.

Language Modelling Speech Enhancement +1

7,875

Paper
Code

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

1 code implementation • 15 Dec 2022 • Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan Wang, Yu-An Chung, Yun Tang, Ann Lee, Shinji Watanabe, Juan Pino

We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization.

Denoising Speech-to-Speech Translation +3

29,251

Paper
Code

Context-aware Fine-tuning of Self-supervised Speech Models

no code implementations • 16 Dec 2022 • Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe

During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

no code implementations • 20 Dec 2022 • Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape.

Dialog Act Classification Question Answering +4

Paper
Add Code

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

no code implementations • 21 Dec 2022 • Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study

1 code implementation • 22 Jan 2023 • Massa Baali, Tomoki Hayashi, Hamdy Mubarak, Soumi Maiti, Shinji Watanabe, Wassim El-Hajj, Ahmed Ali

Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

7,876

Paper
Code

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

1 code implementation • 30 Jan 2023 • Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data.

Language Modelling

Paper
Code

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

1 code implementation • 8 Feb 2023 • Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness.

Code Generation Speech Synthesis +1

226

Paper
Code

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

1 code implementation • 14 Feb 2023 • Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space.

Resynthesis

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.