Search Results for author: Shinji Watanabe

Found 279 papers, 91 papers with code

Multi-Modal Data Augmentation for End-to-End ASR

no code implementations • 27 Mar 2018 • Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, Shinji Watanabe

We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using \emph{symbolic} input in addition to the traditional acoustic input.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

A Purely End-to-end System for Multi-speaker Speech Recognition

no code implementations • ACL 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner.

speech-recognition Speech Recognition

Paper
Add Code

Multi-Head Decoder for End-to-End Speech Recognition

no code implementations • 22 Apr 2018 • Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model.

speech-recognition Speech Recognition

Paper
Add Code

ESPnet: End-to-End Speech Processing Toolkit

no code implementations • 30 Mar 2018 • Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai

This paper introduces a new open source platform for end-to-end speech processing named ESPnet.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

no code implementations • 28 Mar 2018 • Jon Barker, Shinji Watanabe, Emmanuel Vincent, Jan Trmal

The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

no code implementations • 27 Mar 2018 • Szu-Jui Chen, Aswin Shanmugam Subramanian, Hainan Xu, Shinji Watanabe

This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit.

Ranked #2 on Noisy Speech Recognition on CHiME real

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

no code implementations • 21 Nov 2017 • Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan

Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients.

Robust Speech Recognition speech-recognition

Paper
Add Code

Multichannel End-to-end Speech Recognition

no code implementations • ICML 2017 • Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey

The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology.

Language Modelling Speech Enhancement +2

Paper
Add Code

Low-Resource Contextual Topic Identification on Speech

no code implementations • 17 Jul 2018 • Chunxi Liu, Matthew Wiesner, Shinji Watanabe, Craig Harman, Jan Trmal, Najim Dehak, Sanjeev Khudanpur

In topic identification (topic ID) on real-world unstructured audio, an audio instance of variable topic shifts is first broken into sequential segments, and each segment is independently classified.

General Classification Topic Classification +1

Paper
Add Code

Back-Translation-Style Data Augmentation for End-to-End ASR

no code implementations • 28 Jul 2018 • Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, Kazuya Takeda

In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

End-to-end Speech Recognition with Word-based RNN Language Models

no code implementations • 8 Aug 2018 • Takaaki Hori, Jaejin Cho, Shinji Watanabe

This paper investigates the impact of word-based RNN language models (RNN-LMs) on the performance of end-to-end automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Phasebook and Friends: Leveraging Discrete Representations for Source Separation

no code implementations • 2 Oct 2018 • Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, John R. Hershey

Here, we propose "magbook", "phasebook", and "combook", three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks.

Speaker Separation Speech Enhancement

Paper
Add Code

Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling

no code implementations • 4 Oct 2018 • Jaejin Cho, Murali Karthick Baskar, Ruizhi Li, Matthew Wiesner, Sri Harish Mallidi, Nelson Yalta, Martin Karafiat, Shinji Watanabe, Takaaki Hori

In this work, we attempt to use data from 10 BABEL languages to build a multi-lingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach.

Language Modelling Sequence-To-Sequence Speech Recognition +2

Paper
Add Code

Cycle-consistency training for end-to-end speech recognition

no code implementations • 2 Nov 2018 • Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji Watanabe, Jonathan Le Roux

To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Transfer learning of language-independent end-to-end ASR with language model fusion

no code implementations • 6 Nov 2018 • Hirofumi Inaguma, Jaejin Cho, Murali Karthick Baskar, Tatsuya Kawahara, Shinji Watanabe

This work explores better adaptation methods to low-resource languages using an external language model (LM) under the framework of transfer learning.

Language Modelling Transfer Learning

Paper
Add Code

End-to-End Monaural Multi-speaker ASR System without Pretraining

no code implementations • 5 Nov 2018 • Xuankai Chang, Yanmin Qian, Kai Yu, Shinji Watanabe

The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Promising Accurate Prefix Boosting for sequence-to-sequence ASR

no code implementations • 7 Nov 2018 • Murali Karthick Baskar, Lukáš Burget, Shinji Watanabe, Martin Karafiát, Takaaki Hori, Jan Honza Černocký

In this paper, we present promising accurate prefix boosting (PAPB), a discriminative training technique for attention based sequence-to-sequence (seq2seq) ASR.

Paper
Add Code

CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

no code implementations • 7 Nov 2018 • Nelson Yalta, Shinji Watanabe, Takaaki Hori, Kazuhiro Nakadai, Tetsuya OGATA

By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this study attempts to overcome the presents difficulties in everyday environments.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Building Corpora for Single-Channel Speech Separation Across Multiple Domains

no code implementations • 6 Nov 2018 • Matthew Maciejewski, Gregory Sell, Leibny Paola Garcia-Perera, Shinji Watanabe, Sanjeev Khudanpur

To date, the bulk of research on single-channel speech separation has been conducted using clean, near-field, read speech, which is not representative of many modern applications.

Speech Separation

Paper
Add Code

Analysis of Multilingual Sequence-to-Sequence speech recognition systems

no code implementations • 7 Nov 2018 • Martin Karafiát, Murali Karthick Baskar, Shinji Watanabe, Takaaki Hori, Matthew Wiesner, Jan "Honza'' Černocký

This paper investigates the applications of various multilingual approaches developed in conventional hidden Markov model (HMM) systems to sequence-to-sequence (seq2seq) automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Stream attention-based multi-array end-to-end speech recognition

no code implementations • 12 Nov 2018 • Xiaofei Wang, Ruizhi Li, Sri Harish Mallid, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Multi-encoder multi-resolution framework for end-to-end speech recognition

no code implementations • 12 Nov 2018 • Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

In this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework based on the joint CTC/Attention model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition

no code implementations • 12 Nov 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe

In this paper, we propose a parallelism technique for beam search, which accelerates the search process by vectorizing multiple hypotheses to eliminate the for-loop program.

speech-recognition Speech Recognition

Paper
Add Code

Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling

no code implementations • 10 Nov 2018 • Hainan Xu, Shuoyang Ding, Shinji Watanabe

Most end-to-end speech recognition systems model text directly as a sequence of characters or sub-words.

Automatic Speech Recognition (ASR) speech-recognition

Paper
Add Code

Pretraining by Backtranslation for End-to-end ASR in Low-Resource Settings

no code implementations • 10 Dec 2018 • Matthew Wiesner, Adithya Renduchintala, Shinji Watanabe, Chunxi Liu, Najim Dehak, Sanjeev Khudanpur

Using transcribed speech from nearby languages gives a further 20-30% relative reduction in character error rate.

Data Augmentation

Paper
Add Code

Statistical Dialogue Management using Intention Dependency Graph

no code implementations • IJCNLP 2013 • Koichiro Yoshino, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

Dialogue Management Management

Paper
Add Code

Massively Multilingual Adversarial Speech Recognition

no code implementations • NAACL 2019 • Oliver Adams, Matthew Wiesner, Shinji Watanabe, David Yarowsky

We report on adaptation of multilingual end-to-end speech recognition models trained on as many as 100 languages.

General Classification speech-recognition +1

Paper
Add Code

An Investigation of End-to-End Multichannel Speech Recognition for Reverberant and Mismatch Conditions

no code implementations • 19 Apr 2019 • Aswin Shanmugam Subramanian, Xiaofei Wang, Shinji Watanabe, Toru Taniguchi, Dung Tran, Yuya Fujita

This report investigates the ability of E2E ASR from standard close-talk to far-field applications by encompassing entire multichannel speech enhancement and ASR components within the S2S model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

no code implementations • 30 Apr 2019 • Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki Hori, Lukáš Burget, Jan Černocký

Such techniques derive training procedures and losses able to leverage unpaired speech and/or text data by combining ASR with Text-to-Speech (TTS) models.

Ranked #33 on Semi-Supervised Image Classification on ImageNet - 10% labeled data (Top 5 Accuracy metric)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Multi-Stream End-to-End Speech Recognition

no code implementations • 17 Jun 2019 • Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Shinji Watanabe, Takaaki Hori, Hynek Hermansky

Two representative framework have been proposed and discussed, which are Multi-Encoder Multi-Resolution (MEM-Res) framework and Multi-Encoder Multi-Array (MEM-Array) framework, respectively.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

no code implementations • 26 Jun 2019 • Naoyuki Kanda, Shota Horiguchi, Ryoichi Takashima, Yusuke Fujita, Kenji Nagamatsu, Shinji Watanabe

In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

no code implementations • 17 Sep 2019 • Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe

Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2. 1 % from that of TS-ASR given oracle speaker embeddings.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

no code implementations • 15 Oct 2019 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition.

speech-recognition Speech Recognition +1

Paper
Add Code

Transformer ASR with Contextual Block Processing

no code implementations • 16 Oct 2019 • Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

In this paper, we propose a new block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

A practical two-stage training strategy for multi-stream end-to-end speech recognition

no code implementations • 23 Oct 2019 • Ruizhi Li, Gregory Sell, Xiaofei Wang, Shinji Watanabe, Hynek Hermansky

The multi-stream paradigm of audio processing, in which several sources are simultaneously considered, has been an active research area for information fusion.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Towards Online End-to-end Transformer Automatic Speech Recognition

no code implementations • 25 Oct 2019 • Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

In this paper, we extend it towards an entire online E2E ASR system by introducing an online decoding process inspired by monotonic chunkwise attention (MoChA) into the Transformer decoder.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition

no code implementations • 10 Nov 2019 • Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak

In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

no code implementations • 18 Nov 2019 • Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, Kevin Wilson, Desh Raj, Shinji Watanabe, Zhuo Chen, John R. Hershey

This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation.

Speaker Separation Speech Enhancement +3

Paper
Add Code

End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection

no code implementations • 3 Feb 2020 • Takenori Yoshimura, Tomoki Hayashi, Kazuya Takeda, Shinji Watanabe

The proposed method is publicly available.

Action Detection Activity Detection +3

Paper
Add Code

End-to-End Multi-speaker Speech Recognition with Transformer

no code implementations • 10 Feb 2020 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios.

speech-recognition Speech Recognition

Paper
Add Code

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

no code implementations • 20 Apr 2020 • Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6).

speaker-diarization Speaker Diarization +4

Paper
Add Code

DiscreTalk: Text-to-Speech as a Machine Translation Problem

no code implementations • 12 May 2020 • Tomoki Hayashi, Shinji Watanabe

This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

no code implementations • 18 May 2020 • Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori Kobayashi

In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC.

Audio and Speech Processing Sound

Paper
Add Code

Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

no code implementations • NeurIPS 2020 • Jing Shi, Xuankai Chang, Pengcheng Guo, Shinji Watanabe, Yusuke Fujita, Jiaming Xu, Bo Xu, Lei Xie

This model additionally has a simple and efficient stop criterion for the end of the transduction, making it able to infer the variable number of output sequences.

Ranked #3 on Speech Separation on WSJ0-4mix

speech-recognition Speech Recognition +1

Paper
Add Code

Speaker-Conditional Chain Model for Speech Separation and Extraction

no code implementations • 25 Jun 2020 • Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe, Bo Xu

With the predicted speaker information from whole observation, our model is helpful to solve the problem of conventional speech separation and speaker extraction for multi-round long recordings.

Audio and Speech Processing Sound

Paper
Add Code

Insertion-Based Modeling for End-to-End Automatic Speech Recognition

no code implementations • 27 May 2020 • Yuya Fujita, Shinji Watanabe, Motoi Omachi, Xuankai Chan

One NAT model, mask-predict, has been applied to ASR but the model needs some heuristics or additional component to estimate the length of the output token sequence.

Audio and Speech Processing Sound

Paper
Add Code

Augmentation adversarial training for self-supervised speaker recognition

no code implementations • 23 Jul 2020 • Jaesung Huh, Hee Soo Heo, Jingu Kang, Shinji Watanabe, Joon Son Chung

Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general.

Contrastive Learning Speaker Recognition

Paper
Add Code

Streaming Transformer ASR with Blockwise Synchronous Inference

no code implementations • 25 Jun 2020 • Emiru Tsunoo, Yosuke Kashiwagi, Shinji Watanabe

In this paper, we extend block processing towards an entire streaming E2E ASR system without additional training, by introducing a blockwise synchronous decoding process inspired by a neural transducer into the Transformer decoder.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

no code implementations • 26 Oct 2020 • Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, Tetsunori Kobayashi

While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

no code implementations • 25 Oct 2020 • Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems.

Translation

Paper
Add Code

Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization

no code implementations • 30 Oct 2020 • Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe, Meng Yu, Yong Xu, Shi-Xiong Zhang, Dong Yu

The advantages of D-ASR over existing methods are threefold: (1) it provides explicit speaker locations, (2) it improves the explainability factor, and (3) it achieves better ASR performance as the process is more streamlined.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

no code implementations • 3 Nov 2020 • Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, Jinyu Li, Scott Wisdom, John R. Hershey

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information

no code implementations • 15 Nov 2020 • Yen-Ju Lu, Chia-Yu Chang, Cheng Yu, Ching-Feng Liu, Jeih-weih Hung, Shinji Watanabe, Yu Tsao

Experimental results from speech denoising, speech dereverberation, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording

no code implementations • 17 Dec 2020 • Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen

Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years.

Clustering Speech Separation

Paper
Add Code

End-to-End Speaker Diarization as Post-Processing

no code implementations • 18 Dec 2020 • Shota Horiguchi, Paola Garcia, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu

Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker.

Clustering Multi-Label Classification +2

Paper
Add Code

Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers

no code implementations • 21 Jan 2021 • Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Paola Garcia, Kenji Nagamatsu

We propose a streaming diarization method based on an end-to-end neural diarization (EEND) model, which handles flexible numbers of speakers and overlapping speech.

Speaker Diarization Sound Audio and Speech Processing

Paper
Add Code

A Review of Speaker Diarization: Recent Advances with Deep Learning

no code implementations • 24 Jan 2021 • Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, Shrikanth Narayanan

Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when".

Retrieval speaker-diarization +3

Paper
Add Code

Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec

no code implementations • 26 Jan 2021 • Jiatong Shi, Jonathan D. Amith, Rey Castillo García, Esteban Guadalupe Sierra, Kevin Duh, Shinji Watanabe

"Transcription bottlenecks", created by a shortage of effective human transcribers are one of the main challenges to endangered language (EL) documentation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

no code implementations • 2 Feb 2021 • Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge.

Clustering

Paper
Add Code

Intermediate Loss Regularization for CTC-based Speech Recognition

no code implementations • 5 Feb 2021 • Jaesong Lee, Shinji Watanabe

In addition, we propose to combine this intermediate CTC loss with stochastic depth training, and apply this combination to a recently proposed Conformer network.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation

no code implementations • 26 Nov 2020 • Jiatong Shi, Chunlei Zhang, Chao Weng, Shinji Watanabe, Meng Yu, Dong Yu

Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers.

Speech Enhancement Speech Extraction +1 Sound Audio and Speech Processing

Paper
Add Code

Online End-to-End Neural Diarization with Speaker-Tracing Buffer

no code implementations • 4 Jun 2020 • Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu

This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND).

speaker-diarization Speaker Diarization

Paper
Add Code

Dual-Path Modeling for Long Recording Speech Separation in Meetings

no code implementations • 23 Feb 2021 • Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian

A transformer-based dual-path system is proposed, which integrates transform layers for global modeling.

Speech Separation

Paper
Add Code

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

no code implementations • 23 Feb 2021 • Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian

Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions.

Action Detection Activity Detection +4

Paper
Add Code

Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

no code implementations • 18 Feb 2021 • Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition

no code implementations • 16 Feb 2021 • Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe, Meng Yu, Dong Yu

In addition to using the prediction error as a metric for evaluating our localization model, we also establish its potency as a frontend with automatic speech recognition (ASR) as the downstream task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation

no code implementations • NAACL 2021 • Hirofumi Inaguma, Tatsuya Kawahara, Shinji Watanabe

To leverage the full potential of the source language information, we propose backward SeqKD, SeqKD from a target-to-source backward NMT model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

no code implementations • NAACL 2021 • Siddharth Dalmia, Brian Yan, Vikas Raunak, Florian Metze, Shinji Watanabe

In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks.

speech-recognition Speech Recognition +1

Paper
Add Code

End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

no code implementations • 5 May 2021 • Soumi Maiti, Hakan Erdogan, Kevin Wilson, Scott Wisdom, Shinji Watanabe, John R. Hershey

We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.

Clustering Speaker Identification +1

Paper
Add Code

Self-Guided Curriculum Learning for Neural Machine Translation

no code implementations • ACL (IWSLT) 2021 • Lei Zhou, Liang Ding, Kevin Duh, Shinji Watanabe, Ryohei Sasano, Koichi Takeda

In the field of machine learning, the well-trained model is assumed to be able to recover the training labels, i. e. the synthetic labels predicted by the model should be as close to the ground-truth labels as possible.

Machine Translation NMT +2

Paper
Add Code

Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios

no code implementations • 7 Jun 2021 • Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe

Although end-to-end automatic speech recognition (E2E ASR) has achieved great performance in tasks that have numerous paired data, it is still challenging to make E2E ASR robust against noisy and low-resource conditions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

no code implementations • 8 Jun 2021 • Yuki Takashima, Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Paola García, Kenji Nagamatsu

In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND).

Clustering speaker-diarization +1

Paper
Add Code

Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization

no code implementations • 9 Jun 2021 • Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Paola García, Kenji Nagamatsu

To evaluate our proposed method, we conduct the experiments of model adaptation using labeled and unlabeled data.

Clustering Pseudo Label

Paper
Add Code

Leveraging Pre-trained Language Model for Speech Sentiment Analysis

no code implementations • 11 Jun 2021 • Suwon Shon, Pablo Brusco, Jing Pan, Kyu J. Han, Shinji Watanabe

In this paper, we explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

End-to-end ASR to jointly predict transcriptions and linguistic annotations

no code implementations • NAACL 2021 • Motoi Omachi, Yuya Fujita, Shinji Watanabe, Matthew Wiesner

We propose a Transformer-based sequence-to-sequence model for automatic speech recognition (ASR) capable of simultaneously transcribing and annotating audio with linguistic information such as phonemic transcripts or part-of-speech (POS) tags.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yol\'oxochitl Mixtec

no code implementations • EACL 2021 • Jiatong Shi, Jonathan D. Amith, Rey Castillo Garc{\'\i}a, Esteban Guadalupe Sierra, Kevin Duh, Shinji Watanabe

{``}Transcription bottlenecks{''}, created by a shortage of effective human transcribers (i. e., transcriber shortage), are one of the main challenges to endangered language (EL) documentation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Layer Pruning on Demand with Intermediate CTC

no code implementations • 17 Jun 2021 • Jaesong Lee, Jingu Kang, Shinji Watanabe

Deploying an end-to-end automatic speech recognition (ASR) model on mobile/embedded devices is a challenging task, since the device computational power and energy consumption requirements are dynamically changed in practice.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Multi-mode Transformer Transducer with Stochastic Future Context

no code implementations • 17 Jun 2021 • Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

A Multi-mode ASR model can fulfill various latency requirements during inference -- when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Encoder-Decoder Based Attractors for End-to-End Neural Diarization

no code implementations • 20 Jun 2021 • Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Paola Garcia

Diarization results are then estimated as dot products of the attractors and embeddings.

speaker-diarization Speaker Diarization

Paper
Add Code

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding

no code implementations • 29 Jun 2021 • Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe, Alan W Black

Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

ESPnet-ST IWSLT 2021 Offline Speech Translation System

no code implementations • ACL (IWSLT) 2021 • Hirofumi Inaguma, Brian Yan, Siddharth Dalmia, Pengcheng Guo, Jiatong Shi, Kevin Duh, Shinji Watanabe

This year we made various efforts on training data, architecture, and audio segmentation.

Knowledge Distillation speech-recognition +2

Paper
Add Code

Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors

no code implementations • 4 Jul 2021 • Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yawen Xue, Yuki Takashima, Yohei Kawaguchi

This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited.

Clustering

Paper
Add Code

Toward Streaming ASR with Non-Autoregressive Insertion-based Model

no code implementations • 18 Dec 2020 • Yuya Fujita, Tianzi Wang, Shinji Watanabe, Motoi Omachi

We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

On Prosody Modeling for ASR+TTS based Voice Conversion

no code implementations • 20 Jul 2021 • Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda

In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker

no code implementations • 7 Aug 2021 • Maokui He, Desh Raj, Zili Huang, Jun Du, Zhuo Chen, Shinji Watanabe

Target-speaker voice activity detection (TS-VAD) has recently shown promising results for speaker diarization on highly overlapped speech.

Action Detection Activity Detection +3

Paper
Add Code

Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

no code implementations • 9 Sep 2021 • Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.

Language Modelling Translation

Paper
Add Code

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

no code implementations • 11 Oct 2021 • Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe

Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Multi-Channel End-to-End Neural Diarization with Distributed Microphones

no code implementations • 10 Oct 2021 • Shota Horiguchi, Yuki Takashima, Paola Garcia, Shinji Watanabe, Yohei Kawaguchi

With simulated and real-recorded datasets, we demonstrated that the proposed method outperformed conventional EEND when a multi-channel input was given while maintaining comparable performance with a single-channel input.

speaker-diarization Speaker Diarization

Paper
Add Code

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

no code implementations • 9 Oct 2021 • Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-Yi Lee, Shinji Watanabe

We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition

no code implementations • 11 Oct 2021 • Jing Pan, Tao Lei, Kwangyoun Kim, Kyu Han, Shinji Watanabe

The Transformer architecture has been well adopted as a dominant architecture in most sequence transduction tasks including automatic speech recognition (ASR), since its attention mechanism excels in capturing long-range dependencies.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

no code implementations • 27 Oct 2021 • Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian

The deep learning based time-domain models, e. g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement.

Speech Enhancement speech-recognition +1

Paper
Add Code

Sequence Transduction with Graph-based Supervision

no code implementations • 1 Nov 2021 • Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

The recurrent neural network transducer (RNN-T) objective plays a major role in building today's best automatic speech recognition (ASR) systems for production.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation

no code implementations • NAACL (AmericasNLP) 2021 • Jiatong Shi, Jonathan D. Amith, Xuankai Chang, Siddharth Dalmia, Brian Yan, Shinji Watanabe

Documentation of endangered languages (ELs) has become increasingly urgent as thousands of languages are on the verge of disappearing by the end of the 21st century.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

End-to-End Multi-Lingual Multi-Speaker Speech Recognition

no code implementations • 27 Sep 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

Several multi-lingual ASR systems were recently proposed based on a monolithic neural network architecture without language-dependent modules, showing that modeling of multiple languages is well within the capabilities of an end-to-end framework.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

no code implementations • 29 Nov 2021 • Brian Yan, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe, Dong Yu

Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type.

speech-recognition Speech Recognition

Paper
Add Code

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

no code implementations • 17 Dec 2021 • Jing Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu

Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification.

regression Speech Separation

Paper
Add Code

A Study of Transducer based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies

no code implementations • 14 Jan 2022 • Florian Boyer, Yusuke Shinohara, Takaaki Ishii, Hirofumi Inaguma, Shinji Watanabe

In this study, we present recent developments of models trained with the RNN-T loss in ESPnet.

Multi-Task Learning

Paper
Add Code

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

no code implementations • 25 Jan 2022 • Keqi Deng, Zehui Yang, Shinji Watanabe, Yosuke Higuchi, Gaofeng Cheng, Pengyuan Zhang

The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR

no code implementations • 25 Jan 2022 • Emiru Tsunoo, Chaitanya Narisetty, Michael Hentschel, Yosuke Kashiwagi, Shinji Watanabe

To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Joint Speech Recognition and Audio Captioning

no code implementations • 3 Feb 2022 • Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe

A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions.

AudioCaps Audio captioning +4

Paper
Add Code

Acoustic Event Detection with Classifier Chains

no code implementations • 17 Feb 2022 • Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi

In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule to form classifier chains.

Event Detection

Paper
Add Code

ESPnet How2 Speech Translation System for IWSLT 2019: Pre-training, Knowledge Distillation, and Going Deeper

no code implementations • EMNLP (IWSLT) 2019 • Hirofumi Inaguma, Shun Kiyono, Nelson Enrique Yalta Soplin, Jun Suzuki, Kevin Duh, Shinji Watanabe

In this year, we mainly build our systems based on Transformer architectures in all tasks and focus on the end-to-end speech translation (E2E-ST).

Knowledge Distillation NMT +1

Paper
Add Code

Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge

no code implementations • 24 Feb 2022 • Yen-Ju Lu, Samuele Cornell, Xuankai Chang, Wangyou Zhang, Chenda Li, Zhaoheng Ni, Zhong-Qiu Wang, Shinji Watanabe

This paper describes our submission to the L3DAS22 Challenge Task 1, which consists of speech enhancement with 3D Ambisonic microphones.

Speech Enhancement

Paper
Add Code

The JHU/KyotoU Speech Translation System for IWSLT 2018

no code implementations • IWSLT (EMNLP) 2018 • Hirofumi Inaguma, Xuan Zhang, Zhiqi Wang, Adithya Renduchintala, Shinji Watanabe, Kevin Duh

This paper describes the Johns Hopkins University (JHU) and Kyoto University submissions to the Speech Translation evaluation campaign at IWSLT2018.

Transfer Learning Translation

Paper
Add Code

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

no code implementations • 1 Mar 2022 • Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Investigating self-supervised learning for speech enhancement and separation

no code implementations • 15 Mar 2022 • Zili Huang, Shinji Watanabe, Shu-wen Yang, Paola Garcia, Sanjeev Khudanpur

Speech enhancement and separation are two fundamental tasks for robust speech processing.

Self-Supervised Learning Speech Enhancement +1

Paper
Add Code

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy

no code implementations • 31 Mar 2022 • Shuai Guo, Jiatong Shi, Tao Qian, Shinji Watanabe, Qin Jin

Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities, compared to conventional statistical parametric based methods.

Data Augmentation Singing Voice Synthesis

Paper
Add Code

Memory-Efficient Training of RNN-Transducer with Sampled Softmax

no code implementations • 31 Mar 2022 • Jaesong Lee, Lukas Lee, Shinji Watanabe

RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

no code implementations • 1 Apr 2022 • Xuankai Chang, Takashi Maekaku, Yuya Fujita, Shinji Watanabe

This work presents our end-to-end (E2E) automatic speech recognition (ASR) model targetting at robust speech recognition, called Integraded speech Recognition with enhanced speech Input for Self-supervised learning representation (IRIS).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

End-to-End Multi-speaker ASR with Independent Vector Analysis

no code implementations • 1 Apr 2022 • Robin Scheibler, Wangyou Zhang, Xuankai Chang, Shinji Watanabe, Yanmin Qian

We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Better Intermediates Improve CTC Inference

no code implementations • 1 Apr 2022 • Tatsuya Komatsu, Yusuke Fujita, Jaesong Lee, Lukas Lee, Shinji Watanabe, Yusuke Kida

This paper proposes a method for improved CTC inference with searched intermediates and multi-pass conditioning.

Paper
Add Code

Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation

no code implementations • 19 Apr 2022 • Keqi Deng, Shinji Watanabe, Jiatong Shi, Siddhant Arora

Although Transformers have gained success in several speech processing tasks like spoken language understanding (SLU) and speech translation (ST), achieving online processing while keeping competitive performance is still essential for real-world interaction.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Findings of the IWSLT 2022 Evaluation Campaign

no code implementations • IWSLT (ACL) 2022 • Antonios Anastasopoulos, Loïc Barrault, Luisa Bentivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Maha Elbayad, Clara Emmanuel, Yannick Estève, Marcello Federico, Christian Federmann, Souhir Gahbiche, Hongyu Gong, Roman Grundkiewicz, Barry Haddow, Benjamin Hsu, Dávid Javorský, Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Mathur, Paul McNamee, Kenton Murray, Maria Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing Niu, John Ortega, Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco Turchi, Yogesh Virkar, Alexander Waibel, Changhan Wang, Shinji Watanabe

The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation.

Speech-to-Speech Translation Translation

Paper
Add Code

CMU’s IWSLT 2022 Dialect Speech Translation System

no code implementations • IWSLT (ACL) 2022 • Brian Yan, Patrick Fernandes, Siddharth Dalmia, Jiatong Shi, Yifan Peng, Dan Berrebbi, Xinyi Wang, Graham Neubig, Shinji Watanabe

We use additional paired Modern Standard Arabic data (MSA) to directly improve the speech recognition (ASR) and machine translation (MT) components of our cascaded systems.

Knowledge Distillation Machine Translation +3

Paper
Add Code

Self-Supervised Speech Representation Learning: A Review

no code implementations • 21 May 2022 • Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

no code implementations • 6 Jun 2022 • Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yuki Takashima, Yohei Kawaguchi

Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND.

Multi-Label Classification speaker-diarization +1

Paper
Add Code

LegoNN: Building Modular Encoder-Decoder Models

no code implementations • 7 Jun 2022 • Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed

We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Residual Language Model for End-to-end Speech Recognition

no code implementations • 15 Jun 2022 • Emiru Tsunoo, Yosuke Kashiwagi, Chaitanya Narisetty, Shinji Watanabe

In this paper, we propose a simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models

no code implementations • 1 Jul 2022 • Yuki Takashima, Shota Horiguchi, Shinji Watanabe, Paola García, Yohei Kawaguchi

In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Online Continual Learning of End-to-End Speech Recognition Models

no code implementations • 11 Jul 2022 • Muqiao Yang, Ian Lane, Shinji Watanabe

Continual Learning, also known as Lifelong Learning, aims to continually learn from new data as it becomes available.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Two-Pass Low Latency End-to-End Spoken Language Understanding

no code implementations • 14 Jul 2022 • Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan Black, Shinji Watanabe

End-to-end (E2E) models are becoming increasingly popular for spoken language understanding (SLU) systems and are beginning to achieve competitive performance to pipeline-based approaches.

speech-recognition Speech Recognition +2

Paper
Add Code

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

no code implementations • 3 Aug 2022 • Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses.

Language Modelling

Paper
Add Code

Phone Inventories and Recognition for Every Language

no code implementations • LREC 2022 • Xinjian Li, Florian Metze, David R. Mortensen, Alan W Black, Shinji Watanabe

Identifying phone inventories is a crucial component in language documentation and the preservation of endangered languages.

Paper
Add Code

Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization

no code implementations • 7 Oct 2022 • Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia

This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately.

Knowledge Distillation speaker-diarization +2

Paper
Add Code

CTC Alignments Improve Autoregressive Translation

no code implementations • 11 Oct 2022 • Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

On Compressing Sequences for Self-Supervised Speech Models

no code implementations • 13 Oct 2022 • Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-Yi Lee, Hao Tang

Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference.

Self-Supervised Learning

Paper
Add Code

Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

no code implementations • 14 Oct 2022 • Jinchuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units.

Paper
Add Code

SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning

no code implementations • 16 Oct 2022 • Tzu-hsun Feng, Annie Dong, Ching-Feng Yeh, Shu-wen Yang, Tzu-Quan Lin, Jiatong Shi, Kai-Wei Chang, Zili Huang, Haibin Wu, Xuankai Chang, Shinji Watanabe, Abdelrahman Mohamed, Shang-Wen Li, Hung-Yi Lee

We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency.

Audio Generation Representation Learning +2

Paper
Add Code

Large-scale learning of generalised representations for speaker recognition

no code implementations • 20 Oct 2022 • Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesong Lee, Hye-jin Shim, Youngki Kwon, Joon Son Chung, Shinji Watanabe

We also show that training with proposed large data configurations gives better performance.

Inductive Bias Speaker Recognition

Paper
Add Code

In search of strong embedding extractors for speaker diarisation

no code implementations • 26 Oct 2022 • Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.

Data Augmentation Speaker Verification

Paper
Add Code

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

no code implementations • 29 Oct 2022 • Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC).

Language Modelling speech-recognition +2

Paper
Add Code

Articulatory Representation Learning Via Joint Factor Analysis and Neural Matrix Factorization

no code implementations • 29 Oct 2022 • Jiachen Lian, Alan W Black, Yijing Lu, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

In this work, we propose a novel articulatory representation decomposition algorithm that takes the advantage of guided factor analysis to derive the articulatory-specific factors and factor scores.

Representation Learning

Paper
Add Code

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

no code implementations • 2 Nov 2022 • Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Towards Zero-Shot Code-Switched Speech Recognition

no code implementations • 2 Nov 2022 • Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

no code implementations • 4 Nov 2022 • Yusuke Shinohara, Shinji Watanabe

In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models.

speech-recognition Speech Recognition

Paper
Add Code

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

no code implementations • 6 Nov 2022 • Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia, Shinji Watanabe, Ann Lee, Hung-Yi Lee

To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation

no code implementations • 11 Nov 2022 • Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe

To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

no code implementations • 10 Nov 2022 • Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe

Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

Paper
Add Code

Streaming Joint Speech Recognition and Disfluency Detection

1 code implementation • 16 Nov 2022 • Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao Okuda, Siddhant Arora, Shinji Watanabe

In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner.

Language Modelling speech-recognition +1

Paper
Code

Avoid Overthinking in Self-Supervised Models for Speech Recognition

no code implementations • 1 Nov 2022 • Dan Berrebbi, Brian Yan, Shinji Watanabe

Although popular for classification tasks in vision and language, EE has seen less use for sequence-to-sequence speech recognition (ASR) tasks where outputs from early layers are often degenerate.

Self-Supervised Learning Sequence-To-Sequence Speech Recognition +1

Paper
Add Code

Context-aware Fine-tuning of Self-supervised Speech Models

no code implementations • 16 Dec 2022 • Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe

During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

no code implementations • 20 Dec 2022 • Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape.

Dialog Act Classification Question Answering +4

Paper
Add Code

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

no code implementations • 21 Dec 2022 • Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge

no code implementations • 15 Feb 2023 • Samuele Cornell, Zhong-Qiu Wang, Yoshiki Masuyama, Shinji Watanabe, Manuel Pariente, Nobutaka Ono

To address the challenges encountered in the CEC2 setting, we introduce four major novelties: (1) we extend the state-of-the-art TF-GridNet model, originally designed for monaural speaker separation, for multi-channel, causal speech enhancement, and large improvements are observed by replacing the TCNDenseNet used in iNeuBe with this new architecture; (2) we leverage a recent dual window size approach with future-frame prediction to ensure that iNueBe-X satisfies the 5 ms constraint on algorithmic latency required by CEC2; (3) we introduce a novel speaker-conditioning branch for TF-GridNet to achieve target speaker extraction; (4) we propose a fine-tuning step, where we compute an additional loss with respect to the target speaker signal compensated with the listener audiogram.

Speaker Separation Speech Enhancement +1

Paper
Add Code

End-to-End Speech Recognition: A Survey

no code implementations • 3 Mar 2023 • Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe

In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

1 code implementation • 14 Mar 2023 • Yifan Peng, Jaesong Lee, Shinji Watanabe

Transformer-based end-to-end speech recognition has achieved great success.

Model Compression speech-recognition +1

Paper
Code

Enhancing Speech-to-Speech Translation with Multiple TTS Targets

no code implementations • 10 Apr 2023 • Jiatong Shi, Yun Tang, Ann Lee, Hirofumi Inaguma, Changhan Wang, Juan Pino, Shinji Watanabe

It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech.

Speech-to-Speech Translation Speech-to-Text Translation +1

Paper
Add Code

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

no code implementations • 18 Apr 2023 • Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

We propose FSB-LSTM, a novel long short-term memory (LSTM) based architecture that integrates full- and sub-band (FSB) modeling, for single- and multi-channel speech enhancement in the short-time Fourier transform (STFT) domain.

Speech Enhancement

Paper
Add Code

Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History

no code implementations • 1 May 2023 • Siddhant Arora, Hayato Futami, Emiru Tsunoo, Brian Yan, Shinji Watanabe

Most human interactions occur in the form of spoken conversations where the semantic meaning of a given utterance depends on the context.

Spoken Language Understanding

Paper
Add Code

A Study on the Integration of Pipeline and E2E SLU systems for Spoken Semantic Parsing toward STOP Quality Challenge

no code implementations • 2 May 2023 • Siddhant Arora, Hayato Futami, Shih-Lun Wu, Jessica Huynh, Yifan Peng, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

Recently there have been efforts to introduce new benchmark tasks for spoken language understanding (SLU), like semantic parsing.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

The Pipeline System of ASR and NLU with MLM-based Data Augmentation toward STOP Low-resource Challenge

no code implementations • 2 May 2023 • Hayato Futami, Jessica Huynh, Siddhant Arora, Shih-Lun Wu, Yosuke Kashiwagi, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

In the track, we adopt a pipeline approach of ASR and NLU.

Data Augmentation Domain Adaptation +2

Paper
Add Code

Improving Cascaded Unsupervised Speech Translation with Denoising Back-translation

no code implementations • 12 May 2023 • Yu-Kuan Fu, Liang-Hsuan Tseng, Jiatong Shi, Chen-An Li, Tsu-Yuan Hsu, Shinji Watanabe, Hung-Yi Lee

We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS.

Denoising Machine Translation +1

Paper
Add Code

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

no code implementations • 18 May 2023 • Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.

Automatic Speech Recognition Language Identification +3

Paper
Add Code

Tensor decomposition for minimization of E2E SLU model toward on-device processing

no code implementations • 2 Jun 2023 • Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

We reduce the model size by applying tensor decomposition to the Conformer and E-Branchformer architectures used in our E2E SLU models.

speech-recognition Speech Recognition +2

Paper
Add Code

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

no code implementations • 11 Jun 2023 • William Chen, Xuankai Chang, Yifan Peng, Zhaoheng Ni, Soumi Maiti, Shinji Watanabe

Our code and training optimizations make SSL feasible with only 8 GPUs, instead of the 32 used in the original work.

Self-Supervised Learning

Paper
Add Code

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

no code implementations • 23 Jun 2023 • Samuele Cornell, Matthew Wiesner, Shinji Watanabe, Desh Raj, Xuankai Chang, Paola Garcia, Matthew Maciejewski, Yoshiki Masuyama, Zhong-Qiu Wang, Stefano Squartini, Sanjeev Khudanpur

The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

BASS: Block-wise Adaptation for Speech Summarization

no code implementations • 17 Jul 2023 • Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita Singh, Bhiksha Raj

End-to-end speech summarization has been shown to improve performance over cascade baselines.

Paper
Add Code

Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

no code implementations • 20 Jul 2023 • Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework.

speech-recognition Speech Recognition +1

Paper
Add Code

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

no code implementations • 23 Jul 2023 • Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe

In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition

no code implementations • 24 Jul 2023 • Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning.

Automatic Speech Recognition speech-recognition +1

Paper
Add Code

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

no code implementations • 15 Sep 2023 • Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention.

Language Identification speech-recognition +1

Paper
Add Code

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

no code implementations • 15 Sep 2023 • Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.

Image Comprehension Language Modelling +1

Paper
Add Code

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

no code implementations • 14 Sep 2023 • Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.

Language Modelling speech-recognition +3

Paper
Add Code

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

no code implementations • 16 Sep 2023 • Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Because the decoder architecture is the same as an autoregressive LM, it is simple to enhance the model by leveraging external text data with LM training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

no code implementations • 15 Sep 2023 • Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments.

Audio-Visual Speech Recognition speech-recognition +2

Paper
Add Code

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

no code implementations • 20 Sep 2023 • Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, Ondřej Bojar

Further, this method lacks mechanisms for \textit{controlling} the quality vs. latency tradeoff.

Translation

Paper
Add Code

Semi-Autoregressive Streaming ASR With Label Context

no code implementations • 19 Sep 2023 • Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

no code implementations • 26 Sep 2023 • Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

no code implementations • 26 Sep 2023 • William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials.

Denoising Self-Supervised Learning

Paper
Add Code

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

no code implementations • 27 Sep 2023 • Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation.

Machine Translation Speech-to-Text Translation +2

Paper
Add Code

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

no code implementations • 27 Sep 2023 • Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling.

Automatic Speech Recognition Self-Supervised Learning +3

Paper
Add Code

Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

no code implementations • 27 Sep 2023 • Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur

Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied.

Machine Translation Translation

Paper
Add Code

Toward Universal Speech Enhancement for Diverse Input Conditions

no code implementations • 29 Sep 2023 • Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian

Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model.

Denoising Speech Enhancement

Paper
Add Code

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

no code implementations • 2 Oct 2023 • Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini

This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

no code implementations • 4 Oct 2023 • Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models.

Ranked #1 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model

no code implementations • 6 Oct 2023 • Takashi Maekaku, Jiatong Shi, Xuankai Chang, Yuya Fujita, Shinji Watanabe

In this paper, we propose a new approach to enrich the semantic representation of HuBERT.

Automatic Speech Recognition Representation Learning +3

Paper
Add Code

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

no code implementations • 9 Oct 2023 • Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification.

Language Identification speech-recognition +1

Paper
Add Code

A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

no code implementations • 12 Oct 2023 • Kohei Saijo, Wangyou Zhang, Zhong-Qiu Wang, Shinji Watanabe, Tetsunori Kobayashi, Tetsuji Ogawa

We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting.

Denoising Speech Enhancement +2

Paper
Add Code

Generative Context-aware Fine-tuning of Self-supervised Speech Models

no code implementations • 15 Dec 2023 • Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe, Karen Livescu

Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text.

Automatic Speech Recognition named-entity-recognition +6

Paper
Add Code

Phoneme-aware Encoding for Prefix-tree-based Contextual ASR

no code implementations • 15 Dec 2023 • Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Hiroaki Ogawa, Siddhant Arora, Shinji Watanabe

While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations.

speech-recognition Speech Recognition

Paper
Add Code

Improving ASR Contextual Biasing with Guided Attention

no code implementations • 16 Jan 2024 • Jiyang Tang, Kwangyoun Kim, Suwon Shon, Felix Wu, Prashant Sridhar, Shinji Watanabe

Compared to studies with similar motivations, the proposed loss operates directly on the cross attention weights and is easier to implement.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

no code implementations • 19 Jan 2024 • Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

no code implementations • 23 Jan 2024 • Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhong-Qiu Wang, Shinji Watanabe

We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers.

Ranked #1 on Speech Separation on WSJ0-5mix

Speaker Separation Speech Separation

Paper
Add Code

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

no code implementations • 30 Jan 2024 • Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

In this work, we aim to improve the performance and efficiency of OWSM without extra training data.

Paper
Add Code

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition

no code implementations • 31 Jan 2024 • Yihan Wu, Soumi Maiti, Yifan Peng, Wangyou Zhang, Chenda Li, Yuyue Wang, Xihua Wang, Shinji Watanabe, Ruihua Song

Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model.

Language Modelling Speech Enhancement +4

Paper
Add Code

Evaluating and Improving Continual Learning in Spoken Language Understanding

no code implementations • 16 Feb 2024 • Muqiao Yang, Xiang Li, Umberto Cappellazzo, Shinji Watanabe, Bhiksha Raj

In this work, we propose an evaluation methodology that provides a unified evaluation on stability, plasticity, and generalizability in continual learning.

Continual Learning Spoken Language Understanding

Paper
Add Code

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

no code implementations • 20 Feb 2024 • Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe

Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Aligning Speech to Languages to Enhance Code-switching Speech Recognition

no code implementations • 9 Mar 2024 • Hexin Liu, Xiangyu Zhang, Leibny Paola Garcia, Andy W. H. Khong, Eng Siong Chng, Shinji Watanabe

Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14. 1% and 5. 5% relative improvement on test sets of the ASRU and SEAME datasets, respectively.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Wav2Gloss: Generating Interlinear Glossed Text from Speech

no code implementations • 19 Mar 2024 • Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel R. Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R. Mortensen, Lori Levin

Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity.

Paper
Add Code

LV-CTC: Non-autoregressive ASR with CTC and latent variable models

no code implementations • 28 Mar 2024 • Yuya Fujita, Shinji Watanabe, Xuankai Chang, Takashi Maekaku

In this paper, we propose a new model combining CTC and a latent variable model, which is one of the state-of-the-art models in the neural machine translation research field.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity

1 code implementation • 2 Nov 2021 • Peter Wu, Jiatong Shi, Yifan Zhong, Shinji Watanabe, Alan W Black

We demonstrate the effectiveness of our approach in language family classification, speech recognition, and speech synthesis tasks.

Cross-Lingual Transfer speech-recognition +2

Paper
Code

EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

1 code implementation • 13 Apr 2021 • Murali Karthick Baskar, Lukáš Burget, Shinji Watanabe, Ramon Fernandez Astudillo, Jan "Honza'' Černocký

Self-supervised ASR-TTS models suffer in out-of-domain data conditions.

Language Modelling speech-recognition +1

Paper
Code

Differentiable Allophone Graphs for Language-Universal Speech Recognition

1 code implementation • 24 Jul 2021 • Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe

These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language.

speech-recognition Speech Recognition

Paper
Code

When Is TTS Augmentation Through a Pivot Language Useful?

1 code implementation • 20 Jul 2022 • Nathaniel Robinson, Perez Ogayo, Swetha Gangu, David R. Mortensen, Shinji Watanabe

Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

AugSumm: towards generalizable speech summarization using synthetic labels from large language model

1 code implementation • 10 Jan 2024 • Jee-weon Jung, Roshan Sharma, William Chen, Bhiksha Raj, Shinji Watanabe

We tackle this challenge by proposing AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries for training and evaluation.

Language Modelling Large Language Model +1

Paper
Code

Understanding Probe Behaviors through Variational Bounds of Mutual Information

1 code implementation • 15 Dec 2023 • Kwanghee Choi, Jee-weon Jung, Shinji Watanabe

With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation.

Paper
Code

Learning Speaker Embedding from Text-to-Speech

1 code implementation • 21 Oct 2020 • Jaejin Cho, Piotr Zelasko, Jesus Villalba, Shinji Watanabe, Najim Dehak

TTS with speaker classification loss improved EER by 0. 28\% and 0. 73\% absolutely from a model using only speaker classification loss in LibriTTS and Voxceleb1 respectively.

Classification General Classification +2

Paper
Code

Understanding the Tradeoffs in Client-side Privacy for Downstream Speech Tasks

2 code implementations • 22 Jan 2021 • Peter Wu, Paul Pu Liang, Jiatong Shi, Ruslan Salakhutdinov, Shinji Watanabe, Louis-Philippe Morency

As users increasingly rely on cloud-based computing services, it is important to ensure that uploaded speech data remains private.

Representation Learning speech-recognition +1

Paper
Code

Attention-based Multi-hypothesis Fusion for Speech Summarization

2 code implementations • 16 Nov 2021 • Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe

We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Code

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

1 code implementation • 16 Jun 2021 • Pengcheng Guo, Xuankai Chang, Shinji Watanabe, Lei Xie

Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19. 9% and 34. 3% on WSJ0-2mix and WSJ0-3mix sets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.