Search Results for author: Hiroshi Sato

Found 15 papers, 1 papers with code

On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

1 code implementation28 Oct 2022 Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato

This paper investigates the effectiveness and implementation of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis~(MSA).

Multimodal Sentiment Analysis

Multimodal Attention Fusion for Target Speaker Extraction

no code implementations2 Feb 2021 Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues.

Target Speaker Extraction

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

no code implementations2 Jun 2021 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo

', we analyze ASR performance on observed and enhanced speech at various noise and interference conditions, and show that speech enhancement degrades ASR under some conditions even for overlapping speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Listen only to me! How well can target speech extraction handle false alarms?

no code implementations11 Apr 2022 Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolikova, Hiroshi Sato, Tomohiro Nakatani

Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance.

Speaker Identification Speaker Verification +2

Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

no code implementations16 Jun 2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura

Experimental validation reveals the effectiveness of both worst-enrollment target training and SI-loss training to improve robustness against enrollment variations, by increasing speaker discriminability.

Speaker Identification Speech Extraction

Streaming Target-Speaker ASR with Neural Transducer

no code implementations9 Sep 2022 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki

We confirm in experiments that our TS-ASR achieves comparable recognition performance with conventional cascade systems in the offline setting, while reducing computation costs and realizing streaming TS-ASR.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Multi-Perspective Document Revision

no code implementations COLING 2022 Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura

To model the task, we design a novel Japanese multi-perspective document revision dataset that simultaneously handles seven perspectives to improve the readability and clarity of a document.

Grammatical Error Correction Relation Classification +1

Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss

no code implementations24 May 2023 Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo

In this work, we propose a new SE training criterion that minimizes the distance between clean and enhanced signals in the feature representation of the SSL model to alleviate the mismatch.

Self-Supervised Learning Speech Enhancement

How does end-to-end speech recognition training impact speech enhancement artifacts?

no code implementations20 Nov 2023 Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

Jointly training a speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end has been investigated as a way to mitigate the influence of \emph{processing distortion} generated by single-channel SE on ASR.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

no code implementations10 Jan 2024 Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima

The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately.

Self-Supervised Learning Speech Enhancement +2

Cannot find the paper you are looking for? You can Submit a new open access paper.