Search Results for author: Takanori Ashihara

Found 26 papers, 1 papers with code

Guided Speaker Embedding

no code implementations16 Oct 2024 Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

This paper proposes a guided speaker embedding extraction system, which extracts speaker embeddings of the target speaker using speech activities of target and interference speakers as clues.

speaker-diarization Speaker Diarization +1

Investigation of Speaker Representation for Target-Speaker Speech Processing

no code implementations15 Oct 2024 Takanori Ashihara, Takafumi Moriya, Shota Horiguchi, Junyi Peng, Tsubasa Ochiai, Marc Delcroix, Kohei Matsuura, Hiroshi Sato

To this end, for the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders (i. e., self-supervised or speaker recognition models) that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.

Action Detection Activity Detection +6

Alignment-Free Training for Transducer-based Multi-Talker ASR

no code implementations30 Sep 2024 Takafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Masato Mimura

Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers' speech with just one round of encoder processing.

All Automatic Speech Recognition +2

Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

no code implementations1 Aug 2024 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

Using these datasets, our study evaluates two types of Transformer-based models: 1) cascade models that combine ASR and strong text summarization models, and 2) end-to-end (E2E) models that directly convert speech into a text summary.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

no code implementations1 Jul 2024 Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner.

Target Speaker Extraction

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

no code implementations1 Jul 2024 Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima

These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings.

Decoder Speech Synthesis +1

Probing Self-supervised Learning Models with Target Speech Extraction

no code implementations17 Feb 2024 Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Takanori Ashihara, Shoko Araki, Jan Cernocky

TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation.

Self-Supervised Learning Speaker Identification +2

What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

no code implementations31 Jan 2024 Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima

Our analysis unveils that 1) the capacity to represent content information is somewhat unrelated to enhanced speaker representation, 2) specific layers of speech SSL models would be partly specialized in capturing linguistic information, and 3) speaker SSL models tend to disregard linguistic information but exhibit more sophisticated speaker representation.

Self-Supervised Learning

Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

no code implementations10 Jan 2024 Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima

The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately.

Self-Supervised Learning Speech Enhancement +3

Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss

no code implementations24 May 2023 Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo

In this work, we propose a new SE training criterion that minimizes the distance between clean and enhanced signals in the feature representation of the SSL model to alleviate the mismatch.

Self-Supervised Learning Speech Enhancement

Exploration of Language Dependency for Japanese Self-Supervised Speech Representation Models

no code implementations9 May 2023 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka

However, since the two settings have been studied individually in general, there has been little research focusing on how effective a cross-lingual model is in comparison with a monolingual model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

no code implementations24 Apr 2023 Kenichi Fujita, Takanori Ashihara, Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL).

Rhythm Self-Supervised Learning +3

Leveraging Large Text Corpora for End-to-End Speech Summarization

no code implementations2 Mar 2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura

The first technique is to utilize a text-to-speech (TTS) system to generate synthesized speech, which is used for E2E SSum training with the text summary.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models

no code implementations14 Jul 2022 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka

We investigate the performance on SUPERB while varying the structure and KD methods so as to keep the number of parameters constant; this allows us to analyze the contribution of the representation introduced by varying the model architecture.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

no code implementations4 Jul 2021 Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima

However, the conventional method cannot take into account the relationships between these two different modal inputs because the input contexts are separately encoded for each modal.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Cannot find the paper you are looking for? You can Submit a new open access paper.