no code implementations • 27 May 2025 • Jeongsoo Choi, Jaehun Kim, Joon Son Chung
This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed.
1 code implementation • 26 May 2025 • Jeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, Xie Chen
While recent studies have achieved remarkable advancements, their training demands substantial time and computational costs, largely due to the implicit guidance of diffusion models in learning complex intermediate representations.
no code implementations • 29 Apr 2025 • Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung
Despite recent progress, existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.
no code implementations • 3 Apr 2025 • Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues.
no code implementations • CVPR 2025 • Ji-Hoon Kim, Jeongsoo Choi, Jaehun Kim, Chaeyoung Jung, Joon Son Chung
This is achieved by learning of hierarchical representations from video to speech.
no code implementations • 14 Mar 2025 • Sungwoo Cho, Jeongsoo Choi, Sungnyun Kim, Se-Young Yun
Despite recent advances in text-to-speech (TTS) models, audio-visual to audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features.
1 code implementation • 5 Mar 2025 • Youngjoon Jang, Jeongsoo Choi, Junseok Ahn, Joon Son Chung
The objective of this work is to align asynchronous subtitles in sign language videos with limited labelled data.
1 code implementation • 29 Nov 2024 • Jeongsoo Choi, Ji-Hoon Kim, Jinyu Li, Joon Son Chung, Shujie Liu
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos.
no code implementations • 27 Oct 2024 • Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, Furu Wei
Text-to-video models have recently undergone rapid and substantial advancements.
no code implementations • 17 Oct 2024 • Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shukjae Choi, Jinseok Park, Younglo Lee, Joon Son Chung
In our experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models, with minimal quality trade-off or even improvement in terms of speech intelligibility.
1 code implementation • CVPR 2024 • Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro
To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A.
no code implementations • 15 Sep 2023 • Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro
To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.
no code implementations • ICCV 2023 • Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro
In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units.
no code implementations • 15 Aug 2023 • Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements.
2 code implementations • ICCV 2023 • Jeongsoo Choi, Joanna Hong, Yong Man Ro
In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time.
1 code implementation • 3 Aug 2023 • Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro
By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT).
no code implementations • 28 Jun 2023 • Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro
The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio.
2 code implementations • 31 May 2023 • Jeongsoo Choi, Minsu Kim, Yong Man Ro
Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units.
no code implementations • 31 May 2023 • Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro
The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion.
1 code implementation • CVPR 2023 • Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro
Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.
no code implementations • 2 Nov 2022 • Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro
It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.