Search Results for author: Shinji Watanabe

Found 278 papers, 90 papers with code

The JHU/KyotoU Speech Translation System for IWSLT 2018

no code implementations IWSLT (EMNLP) 2018 Hirofumi Inaguma, Xuan Zhang, Zhiqi Wang, Adithya Renduchintala, Shinji Watanabe, Kevin Duh

This paper describes the Johns Hopkins University (JHU) and Kyoto University submissions to the Speech Translation evaluation campaign at IWSLT2018.

Transfer Learning Translation

CMU’s IWSLT 2022 Dialect Speech Translation System

no code implementations IWSLT (ACL) 2022 Brian Yan, Patrick Fernandes, Siddharth Dalmia, Jiatong Shi, Yifan Peng, Dan Berrebbi, Xinyi Wang, Graham Neubig, Shinji Watanabe

We use additional paired Modern Standard Arabic data (MSA) to directly improve the speech recognition (ASR) and machine translation (MT) components of our cascaded systems.

Knowledge Distillation Machine Translation +3

Phone Inventories and Recognition for Every Language

no code implementations LREC 2022 Xinjian Li, Florian Metze, David R. Mortensen, Alan W Black, Shinji Watanabe

Identifying phone inventories is a crucial component in language documentation and the preservation of endangered languages.

Findings of the IWSLT 2022 Evaluation Campaign

no code implementations IWSLT (ACL) 2022 Antonios Anastasopoulos, Loïc Barrault, Luisa Bentivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Maha Elbayad, Clara Emmanuel, Yannick Estève, Marcello Federico, Christian Federmann, Souhir Gahbiche, Hongyu Gong, Roman Grundkiewicz, Barry Haddow, Benjamin Hsu, Dávid Javorský, Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Mathur, Paul McNamee, Kenton Murray, Maria Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing Niu, John Ortega, Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco Turchi, Yogesh Virkar, Alexander Waibel, Changhan Wang, Shinji Watanabe

The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation.

Speech-to-Speech Translation Translation

Self-supervised Representation Learning for Speech Processing

1 code implementation NAACL (ACL) 2022 Hung-Yi Lee, Abdelrahman Mohamed, Shinji Watanabe, Tara Sainath, Karen Livescu, Shang-Wen Li, Shu-wen Yang, Katrin Kirchhoff

Due to the growing popularity of SSL, and the shared mission of the areas in bringing speech and language technologies to more use cases with better quality and scaling the technologies for under-represented languages, we propose this tutorial to systematically survey the latest SSL techniques, tools, datasets, and performance achievement in speech processing.

Representation Learning

LV-CTC: Non-autoregressive ASR with CTC and latent variable models

no code implementations28 Mar 2024 Yuya Fujita, Shinji Watanabe, Xuankai Chang, Takashi Maekaku

In this paper, we propose a new model combining CTC and a latent variable model, which is one of the state-of-the-art models in the neural machine translation research field.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Wav2Gloss: Generating Interlinear Glossed Text from Speech

no code implementations19 Mar 2024 Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel R. Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R. Mortensen, Lori Levin

Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity.

Aligning Speech to Languages to Enhance Code-switching Speech Recognition

no code implementations9 Mar 2024 Hexin Liu, Xiangyu Zhang, Leibny Paola Garcia, Andy W. H. Khong, Eng Siong Chng, Shinji Watanabe

Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14. 1% and 5. 5% relative improvement on test sets of the ASRU and SEAME datasets, respectively.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

no code implementations20 Feb 2024 Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe

Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Evaluating and Improving Continual Learning in Spoken Language Understanding

no code implementations16 Feb 2024 Muqiao Yang, Xiang Li, Umberto Cappellazzo, Shinji Watanabe, Bhiksha Raj

In this work, we propose an evaluation methodology that provides a unified evaluation on stability, plasticity, and generalizability in continual learning.

Continual Learning Spoken Language Understanding

Improving Design of Input Condition Invariant Speech Enhancement

1 code implementation25 Jan 2024 Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated.

Speech Enhancement

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

no code implementations19 Jan 2024 Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Improving ASR Contextual Biasing with Guided Attention

no code implementations16 Jan 2024 Jiyang Tang, Kwangyoun Kim, Suwon Shon, Felix Wu, Prashant Sridhar, Shinji Watanabe

Compared to studies with similar motivations, the proposed loss operates directly on the cross attention weights and is easier to implement.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

AugSumm: towards generalizable speech summarization using synthetic labels from large language model

1 code implementation10 Jan 2024 Jee-weon Jung, Roshan Sharma, William Chen, Bhiksha Raj, Shinji Watanabe

We tackle this challenge by proposing AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries for training and evaluation.

Language Modelling Large Language Model +1

Phoneme-aware Encoding for Prefix-tree-based Contextual ASR

no code implementations15 Dec 2023 Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Hiroaki Ogawa, Siddhant Arora, Shinji Watanabe

While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations.

speech-recognition Speech Recognition

Generative Context-aware Fine-tuning of Self-supervised Speech Models

no code implementations15 Dec 2023 Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe, Karen Livescu

Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text.

Automatic Speech Recognition named-entity-recognition +6

Understanding Probe Behaviors through Variational Bounds of Mutual Information

1 code implementation15 Dec 2023 Kwanghee Choi, Jee-weon Jung, Shinji Watanabe

With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation.

A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

no code implementations12 Oct 2023 Kohei Saijo, Wangyou Zhang, Zhong-Qiu Wang, Shinji Watanabe, Tetsunori Kobayashi, Tetsuji Ogawa

We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting.

Denoising Speech Enhancement +2

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

no code implementations9 Oct 2023 Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification.

Language Identification speech-recognition +1

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

no code implementations4 Oct 2023 Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models.

 Ranked #1 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

no code implementations2 Oct 2023 Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini

This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Toward Universal Speech Enhancement for Diverse Input Conditions

no code implementations29 Sep 2023 Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian

Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model.

Denoising Speech Enhancement

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

no code implementations27 Sep 2023 Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation.

Machine Translation Speech-to-Text Translation +2

Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

no code implementations27 Sep 2023 Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur

Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied.

Machine Translation Translation

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

no code implementations26 Sep 2023 Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

no code implementations26 Sep 2023 William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials.

Denoising Self-Supervised Learning

Semi-Autoregressive Streaming ASR With Label Context

no code implementations19 Sep 2023 Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

1 code implementation18 Sep 2023 Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee

To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark.

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

no code implementations16 Sep 2023 Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Because the decoder architecture is the same as an autoregressive LM, it is simple to enhance the model by leveraging external text data with LM training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

no code implementations15 Sep 2023 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments.

Audio-Visual Speech Recognition speech-recognition +2

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

no code implementations15 Sep 2023 Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.

Image Comprehension Language Modelling +1

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

no code implementations15 Sep 2023 Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention.

Language Identification speech-recognition +1

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

no code implementations14 Sep 2023 Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.

Language Modelling speech-recognition +3

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

1 code implementation19 Aug 2023 Jinchuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu, Shinji Watanabe

While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferred paths and achieve controllable alignment prediction.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition

no code implementations24 Jul 2023 Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning.

Automatic Speech Recognition speech-recognition +1

Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

no code implementations20 Jul 2023 Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework.

speech-recognition Speech Recognition +1

BASS: Block-wise Adaptation for Speech Summarization

no code implementations17 Jul 2023 Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita Singh, Bhiksha Raj

End-to-end speech summarization has been shown to improve performance over cascade baselines.

Deep Speech Synthesis from MRI-Based Articulatory Representations

1 code implementation5 Jul 2023 Peter Wu, Tingle Li, Yijing Lu, Yubin Zhang, Jiachen Lian, Alan W Black, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

Finally, through a series of ablations, we show that the proposed MRI representation is more comprehensive than EMA and identify the most suitable MRI feature subset for articulatory synthesis.

Computational Efficiency Denoising +1

A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning

2 code implementations19 May 2023 Jiyang Tang, William Chen, Xuankai Chang, Shinji Watanabe, Brian MacWhinney

Our system achieves state-of-the-art speaker-level detection accuracy (97. 3%), and a relative WER reduction of 11% for moderate Aphasia patients.

Multi-Task Learning speech-recognition +1

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

1 code implementation18 May 2023 Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.

Audio-Visual Speech Recognition Prompt Engineering +2

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

no code implementations18 May 2023 Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.

Automatic Speech Recognition Language Identification +3

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

2 code implementations18 May 2023 Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe

Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History

no code implementations1 May 2023 Siddhant Arora, Hayato Futami, Emiru Tsunoo, Brian Yan, Shinji Watanabe

Most human interactions occur in the form of spoken conversations where the semantic meaning of a given utterance depends on the context.

Spoken Language Understanding

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

1 code implementation25 Apr 2023 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i. e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue.

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

no code implementations18 Apr 2023 Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

We propose FSB-LSTM, a novel long short-term memory (LSTM) based architecture that integrates full- and sub-band (FSB) modeling, for single- and multi-channel speech enhancement in the short-time Fourier transform (STFT) domain.

Speech Enhancement

Enhancing Speech-to-Speech Translation with Multiple TTS Targets

no code implementations10 Apr 2023 Jiatong Shi, Yun Tang, Ann Lee, Hirofumi Inaguma, Changhan Wang, Juan Pino, Shinji Watanabe

It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech.

Speech-to-Speech Translation Speech-to-Text Translation +1

End-to-End Speech Recognition: A Survey

no code implementations3 Mar 2023 Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe

In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

1 code implementation27 Feb 2023 Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, Shinji Watanabe

Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow.

Model Compression Representation Learning +2

Improving Massively Multilingual ASR With Auxiliary CTC Objectives

1 code implementation24 Feb 2023 William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe

In this paper, we introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark, by conditioning the entire model on language identity (LID).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

2 code implementations16 Feb 2023 Muqiao Yang, Joseph Konan, David Bick, Yunyang Zeng, Shuo Han, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features.

Speech Enhancement Time Series +1

Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge

no code implementations15 Feb 2023 Samuele Cornell, Zhong-Qiu Wang, Yoshiki Masuyama, Shinji Watanabe, Manuel Pariente, Nobutaka Ono

To address the challenges encountered in the CEC2 setting, we introduce four major novelties: (1) we extend the state-of-the-art TF-GridNet model, originally designed for monaural speaker separation, for multi-channel, causal speech enhancement, and large improvements are observed by replacing the TCNDenseNet used in iNeuBe with this new architecture; (2) we leverage a recent dual window size approach with future-frame prediction to ensure that iNueBe-X satisfies the 5 ms constraint on algorithmic latency required by CEC2; (3) we introduce a novel speaker-conditioning branch for TF-GridNet to achieve target speaker extraction; (4) we propose a fine-tuning step, where we compute an additional loss with respect to the target speaker signal compensated with the listener audiogram.

Speaker Separation Speech Enhancement +1

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

1 code implementation14 Feb 2023 Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space.

Resynthesis

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

1 code implementation8 Feb 2023 Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness.

Code Generation Speech Synthesis +1

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

1 code implementation30 Jan 2023 Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data.

Language Modelling

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

no code implementations21 Dec 2022 Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

no code implementations20 Dec 2022 Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape.

Dialog Act Classification Question Answering +4

Context-aware Fine-tuning of Self-supervised Speech Models

no code implementations16 Dec 2022 Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe

During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

1 code implementation15 Dec 2022 Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan Wang, Yu-An Chung, Yun Tang, Ann Lee, Shinji Watanabe, Juan Pino

We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization.

Denoising Speech-to-Speech Translation +3

SpeechLMScore: Evaluating speech generation using speech language model

2 code implementations8 Dec 2022 Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe

While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming.

Language Modelling Speech Enhancement +1

EURO: ESPnet Unsupervised ASR Open-source Toolkit

1 code implementation30 Nov 2022 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola Garcia, Hung-Yi Lee, Shinji Watanabe, Sanjeev Khudanpur

This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Streaming Joint Speech Recognition and Disfluency Detection

1 code implementation16 Nov 2022 Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao Okuda, Siddhant Arora, Shinji Watanabe

In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner.

Language Modelling speech-recognition +1

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

1 code implementation12 Nov 2022 Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

To address these issues, we devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.

Voice Conversion

Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation

no code implementations11 Nov 2022 Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe

To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

no code implementations6 Nov 2022 Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia, Shinji Watanabe, Ann Lee, Hung-Yi Lee

To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

no code implementations4 Nov 2022 Yusuke Shinohara, Shinji Watanabe

In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models.

speech-recognition Speech Recognition

InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

1 code implementation2 Nov 2022 Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

no code implementations2 Nov 2022 Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Towards Zero-Shot Code-Switched Speech Recognition

no code implementations2 Nov 2022 Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Avoid Overthinking in Self-Supervised Models for Speech Recognition

no code implementations1 Nov 2022 Dan Berrebbi, Brian Yan, Shinji Watanabe

Although popular for classification tasks in vision and language, EE has seen less use for sequence-to-sequence speech recognition (ASR) tasks where outputs from early layers are often degenerate.

Self-Supervised Learning Sequence-To-Sequence Speech Recognition +1

Articulatory Representation Learning Via Joint Factor Analysis and Neural Matrix Factorization

no code implementations29 Oct 2022 Jiachen Lian, Alan W Black, Yijing Lu, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

In this work, we propose a novel articulatory representation decomposition algorithm that takes the advantage of guided factor analysis to derive the articulatory-specific factors and factor scores.

Representation Learning

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

1 code implementation27 Oct 2022 Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe

End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation.

named-entity-recognition Named Entity Recognition +2

In search of strong embedding extractors for speaker diarisation

no code implementations26 Oct 2022 Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.

Data Augmentation Speaker Verification

Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

no code implementations14 Oct 2022 Jinchuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units.

On Compressing Sequences for Self-Supervised Speech Models

no code implementations13 Oct 2022 Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-Yi Lee, Hao Tang

Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference.

Self-Supervised Learning

CTC Alignments Improve Autoregressive Translation

no code implementations11 Oct 2022 Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization

no code implementations7 Oct 2022 Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia

This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately.

Knowledge Distillation speaker-diarization +2

E-Branchformer: Branchformer with Enhanced merging for speech recognition

1 code implementation30 Sep 2022 Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

ESPnet-ONNX: Bridging a Gap Between Research and Production

1 code implementation20 Sep 2022 Masao Someki, Yosuke Higuchi, Tomoki Hayashi, Shinji Watanabe

In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks.

Spoken Language Understanding

Deep Speech Synthesis from Articulatory Representations

1 code implementation13 Sep 2022 Peter Wu, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract.

Speech Synthesis

ASR2K: Speech Recognition for Around 2000 Languages without Audio

1 code implementation6 Sep 2022 Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji Watanabe

We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

Language Modelling Speech Recognition

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

no code implementations3 Aug 2022 Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses.

Language Modelling

When Is TTS Augmentation Through a Pivot Language Useful?

1 code implementation20 Jul 2022 Nathaniel Robinson, Perez Ogayo, Swetha Gangu, David R. Mortensen, Shinji Watanabe

Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

1 code implementation19 Jul 2022 Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

To showcase such integration, we performed experiments on carefully designed synthetic datasets for noisy-reverberant multi-channel ST and SLU tasks, which can be used as benchmark corpora for future research.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Two-Pass Low Latency End-to-End Spoken Language Understanding

no code implementations14 Jul 2022 Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan Black, Shinji Watanabe

End-to-end (E2E) models are becoming increasingly popular for spoken language understanding (SLU) systems and are beginning to achieve competitive performance to pipeline-based approaches.

speech-recognition Speech Recognition +2

Online Continual Learning of End-to-End Speech Recognition Models

no code implementations11 Jul 2022 Muqiao Yang, Ian Lane, Shinji Watanabe

Continual Learning, also known as Lifelong Learning, aims to continually learn from new data as it becomes available.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Improving Speech Enhancement through Fine-Grained Speech Characteristics

1 code implementation1 Jul 2022 Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

We first identify key acoustic parameters that have been found to correlate well with voice quality (e. g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.

Speech Enhancement

Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models

no code implementations1 Jul 2022 Yuki Takashima, Shota Horiguchi, Shinji Watanabe, Paola García, Yohei Kawaguchi

In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Residual Language Model for End-to-end Speech Recognition

no code implementations15 Jun 2022 Emiru Tsunoo, Yosuke Kashiwagi, Chaitanya Narisetty, Shinji Watanabe

In this paper, we propose a simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

LegoNN: Building Modular Encoder-Decoder Models

no code implementations7 Jun 2022 Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed

We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Self-Supervised Speech Representation Learning: A Review

no code implementations21 May 2022 Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation

no code implementations19 Apr 2022 Keqi Deng, Shinji Watanabe, Jiatong Shi, Siddhant Arora

Although Transformers have gained success in several speech processing tasks like spoken language understanding (SLU) and speech translation (ST), achieving online processing while keeping competitive performance is still essential for real-world interaction.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

no code implementations1 Apr 2022 Xuankai Chang, Takashi Maekaku, Yuya Fujita, Shinji Watanabe

This work presents our end-to-end (E2E) automatic speech recognition (ASR) model targetting at robust speech recognition, called Integraded speech Recognition with enhanced speech Input for Self-supervised learning representation (IRIS).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Better Intermediates Improve CTC Inference

no code implementations1 Apr 2022 Tatsuya Komatsu, Yusuke Fujita, Jaesong Lee, Lukas Lee, Shinji Watanabe, Yusuke Kida

This paper proposes a method for improved CTC inference with searched intermediates and multi-pass conditioning.

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

1 code implementation31 Mar 2022 Soumi Maiti, Yushi Ueda, Shinji Watanabe, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Yong Xu

In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting.

speaker-diarization Speaker Diarization +1

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy

no code implementations31 Mar 2022 Shuai Guo, Jiatong Shi, Tao Qian, Shinji Watanabe, Qin Jin

Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities, compared to conventional statistical parametric based methods.

Data Augmentation Singing Voice Synthesis

Acoustic Event Detection with Classifier Chains

no code implementations17 Feb 2022 Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi

In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule to form classifier chains.

Event Detection

Conditional Diffusion Probabilistic Model for Speech Enhancement

2 code implementations10 Feb 2022 Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, Yu Tsao

Speech enhancement is a critical component of many user-oriented audio applications, yet current systems still suffer from distorted and unnatural outputs.

Speech Enhancement Speech Synthesis

Joint Speech Recognition and Audio Captioning

no code implementations3 Feb 2022 Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe

A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions.

AudioCaps Audio captioning +4

Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR

no code implementations25 Jan 2022 Emiru Tsunoo, Chaitanya Narisetty, Michael Hentschel, Yosuke Kashiwagi, Shinji Watanabe

To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

no code implementations17 Dec 2021 Jing Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu

Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification.

regression Speech Separation

Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

no code implementations29 Nov 2021 Brian Yan, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe, Dong Yu

Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type.

speech-recognition Speech Recognition

ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet

2 code implementations29 Nov 2021 Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan, Brian Yan, Ngoc Thang Vu, Alan W Black, Shinji Watanabe

However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks.

Spoken Language Understanding

Attention-based Multi-hypothesis Fusion for Speech Summarization

2 code implementations16 Nov 2021 Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe

We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity

1 code implementation2 Nov 2021 Peter Wu, Jiatong Shi, Yifan Zhong, Shinji Watanabe, Alan W Black

We demonstrate the effectiveness of our approach in language family classification, speech recognition, and speech synthesis tasks.

Cross-Lingual Transfer speech-recognition +2

Sequence Transduction with Graph-based Supervision

no code implementations1 Nov 2021 Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

The recurrent neural network transducer (RNN-T) objective plays a major role in building today's best automatic speech recognition (ASR) systems for production.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

no code implementations27 Oct 2021 Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian

The deep learning based time-domain models, e. g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement.

Speech Enhancement speech-recognition +1