Search Results for author: Shinji Watanabe

Found 354 papers, 119 papers with code

Findings of the IWSLT 2022 Evaluation Campaign

no code implementations IWSLT (ACL) 2022 Antonios Anastasopoulos, Loïc Barrault, Luisa Bentivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Maha Elbayad, Clara Emmanuel, Yannick Estève, Marcello Federico, Christian Federmann, Souhir Gahbiche, Hongyu Gong, Roman Grundkiewicz, Barry Haddow, Benjamin Hsu, Dávid Javorský, Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Mathur, Paul McNamee, Kenton Murray, Maria Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing Niu, John Ortega, Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco Turchi, Yogesh Virkar, Alexander Waibel, Changhan Wang, Shinji Watanabe

The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation.

Speech-to-Speech Translation Translation

Self-supervised Representation Learning for Speech Processing

1 code implementation NAACL (ACL) 2022 Hung-Yi Lee, Abdelrahman Mohamed, Shinji Watanabe, Tara Sainath, Karen Livescu, Shang-Wen Li, Shu-wen Yang, Katrin Kirchhoff

Due to the growing popularity of SSL, and the shared mission of the areas in bringing speech and language technologies to more use cases with better quality and scaling the technologies for under-represented languages, we propose this tutorial to systematically survey the latest SSL techniques, tools, datasets, and performance achievement in speech processing.

Representation Learning

The JHU/KyotoU Speech Translation System for IWSLT 2018

no code implementations IWSLT (EMNLP) 2018 Hirofumi Inaguma, Xuan Zhang, Zhiqi Wang, Adithya Renduchintala, Shinji Watanabe, Kevin Duh

This paper describes the Johns Hopkins University (JHU) and Kyoto University submissions to the Speech Translation evaluation campaign at IWSLT2018.

Decoder Transfer Learning +1

CMU’s IWSLT 2022 Dialect Speech Translation System

no code implementations IWSLT (ACL) 2022 Brian Yan, Patrick Fernandes, Siddharth Dalmia, Jiatong Shi, Yifan Peng, Dan Berrebbi, Xinyi Wang, Graham Neubig, Shinji Watanabe

We use additional paired Modern Standard Arabic data (MSA) to directly improve the speech recognition (ASR) and machine translation (MT) components of our cascaded systems.

Decoder Knowledge Distillation +5

Phone Inventories and Recognition for Every Language

no code implementations LREC 2022 Xinjian Li, Florian Metze, David R. Mortensen, Alan W Black, Shinji Watanabe

Identifying phone inventories is a crucial component in language documentation and the preservation of endangered languages.

Discrete Audio Tokens: More Than a Survey!

no code implementations12 Jun 2025 Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-Yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.

Language Modeling Language Modelling +2

Lessons Learned from the URGENT 2024 Speech Enhancement Challenge

1 code implementation2 Jun 2025 Wangyou Zhang, Kohei Saijo, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Wei Wang, Yihui Fu, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

The URGENT 2024 Challenge aims to foster speech enhancement (SE) techniques with great universality, robustness, and generalizability, featuring a broader task definition, large-scale multi-domain data, and comprehensive evaluation metrics.

Speech Enhancement

On-device Streaming Discrete Speech Units

no code implementations2 Jun 2025 Kwanghee Choi, Masao Someki, Emma Strubell, Shinji Watanabe

Discrete speech units (DSUs) are derived from clustering the features of self-supervised speech models (S3Ms).

Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

no code implementations31 May 2025 Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions.

Language Modeling Language Modelling +7

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning

no code implementations31 May 2025 Yifan Peng, Shakeel Muhammad, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe

To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166, 000 hours of speech across 75 languages.

Explainable Depression Detection using Masked Hard Instance Mining

no code implementations30 May 2025 Patawee Prakrankamanant, Shinji Watanabe, Ekapol Chuangsuwanich

This paper addresses the critical need for improved explainability in text-based depression detection.

Depression Detection

Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC

no code implementations30 May 2025 Qingzheng Wang, Jiancheng Sun, Yifan Peng, Shinji Watanabe

Multilingual speech processing with self-supervised or supervised pre-trained Speech Foundation Models (SFM) has achieved strong performance on tasks like Language Identification (LID) and Automatic Speech Recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Interspeech 2025 URGENT Speech Enhancement Challenge

no code implementations29 May 2025 Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Yihui Fu, Wei Wang, Tim Fingscheidt, Shinji Watanabe

There has been a growing effort to develop universal speech enhancement (SE) to handle inputs with various speech distortions and recording conditions.

Diversity Speech Enhancement

Context-Driven Dynamic Pruning for Large Speech Foundation Models

no code implementations24 May 2025 Masao Someki, Shikhar Bharadwaj, Atharva Anand Joshi, Chyi-Jiunn Lin, Jinchuan Tian, Jee-weon Jung, Markus Müller, Nathan Susanj, Jing Liu, Shinji Watanabe

Speech foundation models achieve strong generalization across languages and acoustic conditions, but require significant computational resources for inference.

On The Landscape of Spoken Language Models: A Comprehensive Survey

no code implementations11 Apr 2025 Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems.

Survey

Summarizing Speech: A Comprehensive Survey

no code implementations10 Apr 2025 Fabian Retkowski, Maike Züfle, Andreas Sudmann, Dinah Pfau, Shinji Watanabe, Jan Niehues, Alexander Waibel

Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content.

Meeting Summarization speech-recognition +3

Aligning Text-to-Music Evaluation with Human Preferences

1 code implementation20 Mar 2025 Yichen Huang, Zachary Novack, Koichi Saito, Jiatong Shi, Shinji Watanabe, Yuki Mitsufuji, John Thickstun, Chris Donahue

In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems.

FAD

ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems

no code implementations11 Mar 2025 Siddhant Arora, Yifan Peng, Jiatong Shi, Jinchuan Tian, William Chen, Shikhar Bharadwaj, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shuichiro Shimizu, Vaibhav Srivastav, Shinji Watanabe

Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems.

Diversity Spoken Dialogue Systems

Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment

1 code implementation10 Feb 2025 Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David Mortensen

Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment.

Discrete Speech Unit Extraction via Independent Component Analysis

1 code implementation11 Jan 2025 Tomohiko Nakamura, Kwanghee Choi, Keigo Hojo, Yoshiaki Bando, Satoru Fukayama, Shinji Watanabe

Self-supervised speech models (S3Ms) have become a common tool for the speech processing community, leveraging representations for downstream tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization

1 code implementation26 Dec 2024 Yihan Wu, Yichen Lu, Yifan Peng, Xihua Wang, Ruihua Song, Shinji Watanabe

Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals.

Automatic Speech Recognition speech-recognition +1

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

1 code implementation7 Dec 2024 Pengcheng Guo, Xuankai Chang, Hang Lv, Shinji Watanabe, Lei Xie

With data augmentation, we establish new state-of-the-art WERs of 14. 6% on the Libri2Mix Test set and 4. 4% on the WSJ0-2Mix Test set.

Automatic Speech Recognition Data Augmentation +3

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

1 code implementation8 Nov 2024 Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang, Fabian Ritter-Gutierrez, Kuan-Po Huang, Siddhant Arora, You-Kuan Lin, Ming To Chuang, Eunjung Yeo, Kalvin Chang, Chung-Ming Chien, Kwanghee Choi, Jun-You Wang, Cheng-Hsiu Hsieh, Yi-Cheng Lin, Chee-En Yu, I-Hsiang Chiu, Heitor R. Guimarães, Jionghao Han, Tzu-Quan Lin, Tzu-Yuan Lin, Homu Chang, Ting-Wu Chang, Chun Wei Chen, Shou-Jen Chen, Yu-Hua Chen, Hsi-Chun Cheng, Kunal Dhawan, Jia-Lin Fang, Shi-Xin Fang, Kuan-Yu Fang Chiang, Chi An Fu, Hsien-Fu Hsiao, Ching Yu Hsu, Shao-Syuan Huang, Lee Chen Wei, Hsi-Che Lin, Hsuan-Hao Lin, Hsuan-Ting Lin, Jian-Ren Lin, Ting-Chun Liu, Li-Chun Lu, Tsung-Min Pai, Ankita Pasad, Shih-Yun Shan Kuan, Suwon Shon, Yuxun Tang, Yun-Shao Tsai, Jui-Chiang Wei, Tzu-Chieh Wei, Chengxi Wu, Dien-Ruei Wu, Chao-Han Huck Yang, Chieh-Chi Yang, Jia Qi Yip, Shao-Xiang Yuan, Vahid Noroozi, Zhehuai Chen, Haibin Wu, Karen Livescu, David Harwath, Shinji Watanabe, Hung-Yi Lee

We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models.

Emotion Recognition

FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model

1 code implementation3 Oct 2024 Yichen Lu, Jiaqi Song, Chao-Han Huck Yang, Shinji Watanabe

In this study, we aim to explore Multitask Speech Language Model (SpeechLM) efficient inference via token reduction.

Emotion Recognition Language Modeling +3

End-to-End Speech Recognition with Pre-trained Masked Language Model

1 code implementation1 Oct 2024 Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

We present a novel approach to end-to-end automatic speech recognition (ASR) that utilizes pre-trained masked language models (LMs) to facilitate the extraction of linguistic information.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

1 code implementation27 Sep 2024 Brian Yan, Vineel Pratap, Shinji Watanabe, Michael Auli

Multilingual Automatic Speech Recognition (ASR) models are typically evaluated in a setting where the ground-truth language of the speech utterance is known, however, this is often not the case for most practical settings.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens

no code implementations24 Sep 2024 Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe

In many real-world scenarios, such as meetings, multiple speakers are present with an unknown number of participants, and their utterances often overlap.

Clustering Decoder +2

Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models

1 code implementation21 Sep 2024 Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, KaiWei Chang, Jiawei Du, Ke-Han Lu, Alexander H. Liu, Ho-Lam Chung, Yuan-Kuei Wu, Dongchao Yang, Songxiang Liu, Yi-Chiao Wu, Xu Tan, James Glass, Shinji Watanabe, Hung-Yi Lee

Neural audio codec models are becoming increasingly important as they serve as tokenizers for audio, enabling efficient transmission or facilitating speech language modeling.

Language Modeling Language Modelling

Preference Alignment Improves Language Model-Based TTS

no code implementations19 Sep 2024 Jinchuan Tian, Chunlei Zhang, Jiatong Shi, Hao Zhang, Jianwei Yu, Shinji Watanabe, Dong Yu

Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based systems offer competitive performance to their counterparts.

Language Modeling Language Modelling +3

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts

no code implementations19 Sep 2024 Yihan Wu, Yifan Peng, Yichen Lu, Xuankai Chang, Ruihua Song, Shinji Watanabe

Moreover, to incorporate visual information effectively, we inject visual information into the ASR model through a mixture-of-experts module.

Mixture-of-Experts Robust Speech Recognition +1

SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

no code implementations18 Sep 2024 Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe

This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data.

DeepFake Detection Diversity +4

Task Arithmetic for Language Expansion in Speech Translation

no code implementations17 Sep 2024 Yao-Fei Cheng, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Wen Shen Teo, Siddhant Arora, Shinji Watanabe

However, expanding language pairs from an existing instruction-tuned ST system is costly due to the necessity of re-training on a combination of new and previous datasets.

Machine Translation Task Arithmetic +1

Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

no code implementations16 Sep 2024 Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald

To this end, we show that the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for the unsupervised learning of speaker representations.

Speaker Recognition Speaker Verification

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

no code implementations14 Sep 2024 Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe

We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models.

Text-To-Speech Synthesis In The Wild

no code implementations13 Sep 2024 Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe

Traditional Text-to-Speech (TTS) systems rely on studio-quality speech recorded in controlled settings. a Recently, an effort known as noisy-TTS training has emerged, aiming to utilize in-the-wild data.

Benchmarking Speaker Recognition +4

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition

1 code implementation17 Aug 2024 Samuele Cornell, Jordan Darefsky, Zhiyao Duan, Shinji Watanabe

In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis.

Language Modeling Language Modelling +7

CMU's IWSLT 2024 Simultaneous Speech Translation System

no code implementations14 Aug 2024 Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei LI, Graham Neubig, Shinji Watanabe

This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner.

Decoder Speech-to-Text +1

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data

1 code implementation1 Aug 2024 Yichen Lu, Jiaqi Song, Xuankai Chang, Hengwei Bian, Soumi Maiti, Shinji Watanabe

In this work, we present SynesLM, an unified model which can perform three multimodal language understanding tasks: audio-visual automatic speech recognition(AV-ASR) and visual-aided speech/machine translation(VST/VMT).

Audio-Visual Speech Recognition Automatic Speech Recognition +5

Multi-Convformer: Extending Conformer with Multiple Convolution Kernels

1 code implementation4 Jul 2024 Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe

Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Towards Robust Speech Representation Learning for Thousands of Languages

no code implementations30 Jun 2024 William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold.

Representation Learning Self-Supervised Learning +1

Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

no code implementations25 Jun 2024 Hye-jin Shim, Md Sahidullah, Jee-weon Jung, Shinji Watanabe, Tomi Kinnunen

Our investigations highlight the significant differences in training dynamics between the two classes, emphasizing the need for future research to focus on robust modeling of the bonafide class.

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

no code implementations23 Jun 2024 Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22. 5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100.

Automatic Speech Recognition speech-recognition +1

Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement

no code implementations19 Jun 2024 Chenda Li, Samuele Cornell, Shinji Watanabe, Yanmin Qian

In this paper, we propose to use discriminative scores from discriminative models in the first steps of the RDP.

Speech Enhancement

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

no code implementations18 Jun 2024 Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe

End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech.

Decoder Language Identification +2

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

no code implementations14 Jun 2024 Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head.

Benchmarking Prediction +3

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

no code implementations13 Jun 2024 Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe

The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models.

Language Modeling Language Modelling +2

Neural Blind Source Separation and Diarization for Distant Speech Recognition

no code implementations12 Jun 2024 Yoshiaki Bando, Tomohiko Nakamura, Shinji Watanabe

This paper presents a neural method for distant speech recognition (DSR) that jointly separates and diarizes speech mixtures without supervision by isolated signals.

blind source separation Distant Speech Recognition +3

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

no code implementations12 Jun 2024 Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-Yi Lee, Shinji Watanabe

This paper presents ML-SUPERB~2. 0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

2 code implementations10 Jun 2024 Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann

We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.

Speech Enhancement

To what extent can ASV systems naturally defend against spoofing attacks?

no code implementations8 Jun 2024 Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, Joon Son Chung

The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target.

Speaker Verification

Joint Beam Search Integrating CTC, Attention, and Transducer Decoders

no code implementations5 Jun 2024 Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

In addition, we propose three novel joint beam search algorithms by combining three decoders (CTC, RNN-T, and attention) to further improve performance.

Automatic Speech Recognition Decoder +2

YODAS: Youtube-Oriented Dataset for Audio and Speech

no code implementations2 Jun 2024 Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe

In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets.

Self-Supervised Learning speech-recognition +1

Cross-Talk Reduction

no code implementations30 May 2024 Zhong-Qiu Wang, Anurag Kumar, Shinji Watanabe

While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time.

Speech Separation

Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System

1 code implementation17 May 2024 Vimal Manohar, Szu-Jui Chen, Zhiqi Wang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays.

Data Augmentation Speech Dereverberation +2

LV-CTC: Non-autoregressive ASR with CTC and latent variable models

no code implementations28 Mar 2024 Yuya Fujita, Shinji Watanabe, Xuankai Chang, Takashi Maekaku

In this paper, we propose a new model combining CTC and a latent variable model, which is one of the state-of-the-art models in the neural machine translation research field.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Wav2Gloss: Generating Interlinear Glossed Text from Speech

1 code implementation19 Mar 2024 Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel R. Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R. Mortensen, Lori Levin

Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity.

Diversity

Evaluating and Improving Continual Learning in Spoken Language Understanding

no code implementations16 Feb 2024 Muqiao Yang, Xiang Li, Umberto Cappellazzo, Shinji Watanabe, Bhiksha Raj

In this work, we propose an evaluation methodology that provides a unified evaluation on stability, plasticity, and generalizability in continual learning.

Continual Learning Spoken Language Understanding

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition

no code implementations31 Jan 2024 Yihan Wu, Soumi Maiti, Yifan Peng, Wangyou Zhang, Chenda Li, Yuyue Wang, Xihua Wang, Shinji Watanabe, Ruihua Song

Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model.

Decoder Language Modeling +6

Improving Design of Input Condition Invariant Speech Enhancement

1 code implementation25 Jan 2024 Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated.

Speech Enhancement

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

no code implementations19 Jan 2024 Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Improving ASR Contextual Biasing with Guided Attention

no code implementations16 Jan 2024 Jiyang Tang, Kwangyoun Kim, Suwon Shon, Felix Wu, Prashant Sridhar, Shinji Watanabe

Compared to studies with similar motivations, the proposed loss operates directly on the cross attention weights and is easier to implement.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

AugSumm: towards generalizable speech summarization using synthetic labels from large language model

1 code implementation10 Jan 2024 Jee-weon Jung, Roshan Sharma, William Chen, Bhiksha Raj, Shinji Watanabe

We tackle this challenge by proposing AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries for training and evaluation.

Language Modeling Language Modelling +2

Understanding Probe Behaviors through Variational Bounds of Mutual Information

1 code implementation15 Dec 2023 Kwanghee Choi, Jee-weon Jung, Shinji Watanabe

With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation.

Phoneme-aware Encoding for Prefix-tree-based Contextual ASR

no code implementations15 Dec 2023 Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Hiroaki Ogawa, Siddhant Arora, Shinji Watanabe

While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations.

speech-recognition Speech Recognition

Generative Context-aware Fine-tuning of Self-supervised Speech Models

no code implementations15 Dec 2023 Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe, Karen Livescu

Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text.

Automatic Speech Recognition named-entity-recognition +6

A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

no code implementations12 Oct 2023 Kohei Saijo, Wangyou Zhang, Zhong-Qiu Wang, Shinji Watanabe, Tetsunori Kobayashi, Tetsuji Ogawa

We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting.

Denoising Speech Enhancement +2

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

no code implementations9 Oct 2023 Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification.

Language Identification speech-recognition +1

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

no code implementations4 Oct 2023 Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models.

 Ranked #1 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

no code implementations2 Oct 2023 Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini

This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition).

All Automatic Speech Recognition +5

Toward Universal Speech Enhancement for Diverse Input Conditions

no code implementations29 Sep 2023 Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian

Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model.

Denoising Speech Enhancement

Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

no code implementations27 Sep 2023 Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur

Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied.

Machine Translation Translation

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

no code implementations27 Sep 2023 Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation.

Decoder Machine Translation +4

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

no code implementations26 Sep 2023 Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

no code implementations26 Sep 2023 William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials.

Denoising Self-Supervised Learning

Semi-Autoregressive Streaming ASR With Label Context

no code implementations19 Sep 2023 Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

1 code implementation18 Sep 2023 Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee

To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark.

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

no code implementations16 Sep 2023 Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Because the decoder architecture is the same as an autoregressive LM, it is simple to enhance the model by leveraging external text data with LM training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

no code implementations15 Sep 2023 Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.

Image Comprehension Language Modeling +2

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

no code implementations15 Sep 2023 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments.

Audio-Visual Speech Recognition speech-recognition +2

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

1 code implementation15 Sep 2023 Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention.

Language Identification speech-recognition +1

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

no code implementations14 Sep 2023 Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.

Decoder Language Modeling +5

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

1 code implementation19 Aug 2023 Jinchuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu, Shinji Watanabe

While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferred paths and achieve controllable alignment prediction.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition

no code implementations24 Jul 2023 Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning.

Automatic Speech Recognition Decoder +2

Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

no code implementations20 Jul 2023 Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework.

speech-recognition Speech Recognition +1

BASS: Block-wise Adaptation for Speech Summarization

no code implementations17 Jul 2023 Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita Singh, Bhiksha Raj

End-to-end speech summarization has been shown to improve performance over cascade baselines.

Deep Speech Synthesis from MRI-Based Articulatory Representations

1 code implementation5 Jul 2023 Peter Wu, Tingle Li, Yijing Lu, Yubin Zhang, Jiachen Lian, Alan W Black, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

Finally, through a series of ablations, we show that the proposed MRI representation is more comprehensive than EMA and identify the most suitable MRI feature subset for articulatory synthesis.

Computational Efficiency Denoising +1

A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning

2 code implementations19 May 2023 Jiyang Tang, William Chen, Xuankai Chang, Shinji Watanabe, Brian MacWhinney

Our system achieves state-of-the-art speaker-level detection accuracy (97. 3%), and a relative WER reduction of 11% for moderate Aphasia patients.

Multi-Task Learning speech-recognition +1

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

1 code implementation18 May 2023 Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.

Audio-Visual Speech Recognition Prompt Engineering +2

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

no code implementations18 May 2023 Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.

Automatic Speech Recognition Language Identification +3

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

2 code implementations18 May 2023 Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe

Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History

no code implementations1 May 2023 Siddhant Arora, Hayato Futami, Emiru Tsunoo, Brian Yan, Shinji Watanabe

Most human interactions occur in the form of spoken conversations where the semantic meaning of a given utterance depends on the context.

Spoken Language Understanding

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

1 code implementation25 Apr 2023 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i. e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue.

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

no code implementations18 Apr 2023 Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

We propose FSB-LSTM, a novel long short-term memory (LSTM) based architecture that integrates full- and sub-band (FSB) modeling, for single- and multi-channel speech enhancement in the short-time Fourier transform (STFT) domain.

Speech Enhancement

Enhancing Speech-to-Speech Translation with Multiple TTS Targets

no code implementations10 Apr 2023 Jiatong Shi, Yun Tang, Ann Lee, Hirofumi Inaguma, Changhan Wang, Juan Pino, Shinji Watanabe

It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech.

Speech-to-Speech Translation Speech-to-Text +4

End-to-End Speech Recognition: A Survey

no code implementations3 Mar 2023 Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe

In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

1 code implementation27 Feb 2023 Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, Shinji Watanabe

Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow.

Model Compression Representation Learning +3

Improving Massively Multilingual ASR With Auxiliary CTC Objectives

1 code implementation24 Feb 2023 William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe

In this paper, we introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark, by conditioning the entire model on language identity (LID).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

2 code implementations16 Feb 2023 Muqiao Yang, Joseph Konan, David Bick, Yunyang Zeng, Shuo Han, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features.

Speech Enhancement Time Series +1

Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge

no code implementations15 Feb 2023 Samuele Cornell, Zhong-Qiu Wang, Yoshiki Masuyama, Shinji Watanabe, Manuel Pariente, Nobutaka Ono

To address the challenges encountered in the CEC2 setting, we introduce four major novelties: (1) we extend the state-of-the-art TF-GridNet model, originally designed for monaural speaker separation, for multi-channel, causal speech enhancement, and large improvements are observed by replacing the TCNDenseNet used in iNeuBe with this new architecture; (2) we leverage a recent dual window size approach with future-frame prediction to ensure that iNueBe-X satisfies the 5 ms constraint on algorithmic latency required by CEC2; (3) we introduce a novel speaker-conditioning branch for TF-GridNet to achieve target speaker extraction; (4) we propose a fine-tuning step, where we compute an additional loss with respect to the target speaker signal compensated with the listener audiogram.

Speaker Separation Speech Enhancement +1

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

1 code implementation14 Feb 2023 Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space.

Resynthesis

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

1 code implementation8 Feb 2023 Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness.

Code Generation Diversity +4

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

1 code implementation30 Jan 2023 Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data.

Language Modeling Language Modelling +2

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

no code implementations21 Dec 2022 Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

no code implementations20 Dec 2022 Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape.

Dialog Act Classification Question Answering +4

Context-aware Fine-tuning of Self-supervised Speech Models

no code implementations16 Dec 2022 Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe

During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

1 code implementation15 Dec 2022 Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan Wang, Yu-An Chung, Yun Tang, Ann Lee, Shinji Watanabe, Juan Pino

We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization.

Decoder Denoising +4

SpeechLMScore: Evaluating speech generation using speech language model

2 code implementations8 Dec 2022 Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe

While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming.

Language Modeling Language Modelling +5

EURO: ESPnet Unsupervised ASR Open-source Toolkit

1 code implementation30 Nov 2022 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola Garcia, Hung-Yi Lee, Shinji Watanabe, Sanjeev Khudanpur

This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Streaming Joint Speech Recognition and Disfluency Detection

1 code implementation16 Nov 2022 Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao Okuda, Siddhant Arora, Shinji Watanabe

In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner.

Decoder Language Modelling +2

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

1 code implementation12 Nov 2022 Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

To address these issues, we devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.

Rhythm Voice Conversion

Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation

no code implementations11 Nov 2022 Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe

To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

no code implementations6 Nov 2022 Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia, Shinji Watanabe, Ann Lee, Hung-Yi Lee

To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

no code implementations4 Nov 2022 Yusuke Shinohara, Shinji Watanabe

In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models.

speech-recognition Speech Recognition

Towards Zero-Shot Code-Switched Speech Recognition

no code implementations2 Nov 2022 Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

no code implementations2 Nov 2022 Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

1 code implementation2 Nov 2022 Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Avoid Overthinking in Self-Supervised Models for Speech Recognition

no code implementations1 Nov 2022 Dan Berrebbi, Brian Yan, Shinji Watanabe

Although popular for classification tasks in vision and language, EE has seen less use for sequence-to-sequence speech recognition (ASR) tasks where outputs from early layers are often degenerate.

Self-Supervised Learning Sequence-To-Sequence Speech Recognition +1

Articulatory Representation Learning Via Joint Factor Analysis and Neural Matrix Factorization

no code implementations29 Oct 2022 Jiachen Lian, Alan W Black, Yijing Lu, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

In this work, we propose a novel articulatory representation decomposition algorithm that takes the advantage of guided factor analysis to derive the articulatory-specific factors and factor scores.

Representation Learning

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

1 code implementation27 Oct 2022 Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe

End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation.

named-entity-recognition Named Entity Recognition +2

In search of strong embedding extractors for speaker diarisation

no code implementations26 Oct 2022 Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.

Data Augmentation Speaker Verification

Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

no code implementations14 Oct 2022 Jinchuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units.

On Compressing Sequences for Self-Supervised Speech Models

no code implementations13 Oct 2022 Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-Yi Lee, Hao Tang

Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference.

Self-Supervised Learning

CTC Alignments Improve Autoregressive Translation

no code implementations11 Oct 2022 Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization

no code implementations7 Oct 2022 Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia

This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately.

Knowledge Distillation speaker-diarization +2

E-Branchformer: Branchformer with Enhanced merging for speech recognition

1 code implementation30 Sep 2022 Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

ESPnet-ONNX: Bridging a Gap Between Research and Production

1 code implementation20 Sep 2022 Masao Someki, Yosuke Higuchi, Tomoki Hayashi, Shinji Watanabe

In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks.

Spoken Language Understanding

Deep Speech Synthesis from Articulatory Representations

1 code implementation13 Sep 2022 Peter Wu, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract.

Speech Synthesis

ASR2K: Speech Recognition for Around 2000 Languages without Audio

1 code implementation6 Sep 2022 Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji Watanabe

We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

Language Modeling Language Modelling +1

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

no code implementations3 Aug 2022 Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses.

Language Modeling Language Modelling

When Is TTS Augmentation Through a Pivot Language Useful?

1 code implementation20 Jul 2022 Nathaniel Robinson, Perez Ogayo, Swetha Gangu, David R. Mortensen, Shinji Watanabe

Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

1 code implementation19 Jul 2022 Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

To showcase such integration, we performed experiments on carefully designed synthetic datasets for noisy-reverberant multi-channel ST and SLU tasks, which can be used as benchmark corpora for future research.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Two-Pass Low Latency End-to-End Spoken Language Understanding

no code implementations14 Jul 2022 Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan Black, Shinji Watanabe

End-to-end (E2E) models are becoming increasingly popular for spoken language understanding (SLU) systems and are beginning to achieve competitive performance to pipeline-based approaches.

speech-recognition Speech Recognition +2

Online Continual Learning of End-to-End Speech Recognition Models

no code implementations11 Jul 2022 Muqiao Yang, Ian Lane, Shinji Watanabe

Continual Learning, also known as Lifelong Learning, aims to continually learn from new data as it becomes available.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models

no code implementations1 Jul 2022 Yuki Takashima, Shota Horiguchi, Shinji Watanabe, Paola García, Yohei Kawaguchi

In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Improving Speech Enhancement through Fine-Grained Speech Characteristics

1 code implementation1 Jul 2022 Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

We first identify key acoustic parameters that have been found to correlate well with voice quality (e. g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.

Deep Learning Speech Enhancement

Residual Language Model for End-to-end Speech Recognition

no code implementations15 Jun 2022 Emiru Tsunoo, Yosuke Kashiwagi, Chaitanya Narisetty, Shinji Watanabe

In this paper, we propose a simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

LegoNN: Building Modular Encoder-Decoder Models

no code implementations7 Jun 2022 Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed

We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Self-Supervised Speech Representation Learning: A Review

no code implementations21 May 2022 Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Cannot find the paper you are looking for? You can Submit a new open access paper.