no code implementations • 13 Dec 2024 • Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho
Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference.
1 code implementation • 17 Jun 2024 • Keon Lee, Dong Won Kim, Jaehyeon Kim, Seungjun Chung, Jaewoong Cho
Large-scale latent diffusion models (LDMs) excel in content generation across various modalities, but their reliance on phonemes and durations in text-to-speech (TTS) limits scalability and access from other fields.
no code implementations • 3 Apr 2024 • Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho
With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis.
1 code implementation • 12 Jul 2023 • Jaewoong Cho, Kartik Sreenivasan, Keon Lee, Kyunghoo Mun, Soheun Yi, Jeong-Gwan Lee, Anna Lee, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee
Contrastive learning has gained significant attention as a method for self-supervised learning.
1 code implementation • NeurIPS 2023 • Taeho Yoon, Kibeom Myoung, Keon Lee, Jaewoong Cho, Albert No, Ernest K. Ryu
Sometimes, however, a pre-trained diffusion model exhibits partial misalignment in the sense that the model can generate good images, but it sometimes outputs undesirable images.
no code implementations • 26 Oct 2022 • Kyumin Park, Keon Lee, Daeyoung Kim, Dongyeop Kang
We present a novel speech dataset, RedPen, with human annotations on unnatural speech regions and their corresponding reasons.
1 code implementation • 3 Jul 2022 • Keon Lee, Kyumin Park, Daeyoung Kim
The majority of current Text-to-Speech (TTS) datasets, which are collections of individual utterances, contain few conversational aspects.
1 code implementation • 17 Mar 2021 • Keon Lee, Kyumin Park, Daeyoung Kim
Previous works on neural text-to-speech (TTS) have been addressed on limited speed in training and inference time, robustness for difficult synthesis conditions, expressiveness, and controllability.