no code implementations • 21 Feb 2022 • Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima
There have been many attempts to build multimodal dialog systems that can respond to a question about given audio-visual information, and the representative task for such systems is the Audio Visual Scene-Aware Dialog (AVSD).
no code implementations • 24 Nov 2021 • Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura
To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language's dataset and the resource-rich language's dataset to learn language-invariant knowledge for scene text recognition.
no code implementations • 22 Nov 2021 • Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura
Dialogue sequence labeling is a supervised learning task that estimates labels for each utterance in the target dialogue document, and is useful for many applications such as dialogue act estimation.
no code implementations • 7 Jul 2021 • Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi, Naoki Makishima
We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems from small-scale rich transcription-style and large-scale common transcription-style datasets.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 4 Jul 2021 • Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima
However, the conventional method cannot take into account the relationships between these two different modal inputs because the input contexts are separately encoded for each modal.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 4 Jul 2021 • Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi
To address this problem, we propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation.
no code implementations • 23 Jun 2021 • Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura
To execute multiple conversion tasks simultaneously without preparing matched datasets, our key idea is to distinguish individual conversion tasks using the on-off switch.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 2 Mar 2021 • Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura
We present an audio-visual speech separation learning method that considers the correspondence between the separated signals and the visual signals to reflect the speech characteristics during training.
no code implementations • 16 Feb 2021 • Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi
This paper presents a novel self-supervised learning method for handling conversational documents consisting of transcribed text of human-to-human conversations.
no code implementations • 16 Feb 2021 • Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi
We evaluate the effectiveness of the proposed model and proposed training method on Japanese discourse ASR tasks.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 15 Feb 2021 • Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura
However, these models require a large amount of paired data of spoken-style text and style normalized text, and it is difficult to prepare such a volume of data.
no code implementations • INLG (ACL) 2020 • Mana Ihori, Ryo Masumura, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi
Thus, it is important to leverage memorized knowledge in the external LM for building the seq2seq model, since it is hard to prepare a large amount of paired data.