Search Results for author: Jeongsoo Choi

Found 21 papers, 8 papers with code

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

no code implementations27 May 2025 Jeongsoo Choi, Jaehun Kim, Joon Son Chung

This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed.

Speech-to-Speech Translation Translation

Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment

1 code implementation26 May 2025 Jeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, Xie Chen

While recent studies have achieved remarkable advancements, their training demands substantial time and computational costs, largely due to the implicit guidance of diffusion models in learning complex intermediate representations.

text-to-speech Text to Speech

AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

no code implementations29 Apr 2025 Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung

Despite recent progress, existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.

In-Context Learning Speech Synthesis +2

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

no code implementations3 Apr 2025 Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues.

Speech Synthesis

MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

no code implementations14 Mar 2025 Sungwoo Cho, Jeongsoo Choi, Sungnyun Kim, Se-Young Yun

Despite recent advances in text-to-speech (TTS) models, audio-visual to audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features.

text-to-speech Text to Speech +1

Deep Understanding of Sign Language for Sign to Subtitle Alignment

1 code implementation5 Mar 2025 Youngjoon Jang, Jeongsoo Choi, Junseok Ahn, Joon Son Chung

The objective of this work is to align asynchronous subtitles in sign language videos with limited labelled data.

Translation Video Alignment

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

1 code implementation29 Nov 2024 Jeongsoo Choi, Ji-Hoon Kim, Jinyu Li, Joon Son Chung, Shujie Liu

In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos.

Decoder

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding

no code implementations17 Oct 2024 Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shukjae Choi, Jinseok Park, Younglo Lee, Joon Son Chung

In our experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models, with minimal quality trade-off or even improvement in terms of speech intelligibility.

Speech Synthesis

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

1 code implementation CVPR 2024 Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro

To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A.

Self-Supervised Learning Speech-to-Speech Translation +1

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

no code implementations15 Sep 2023 Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.

Image Comprehension Language Modeling +2

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

no code implementations ICCV 2023 Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro

In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units.

Lip Reading

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

2 code implementations ICCV 2023 Jeongsoo Choi, Joanna Hong, Yong Man Ro

In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time.

Speech Synthesis

Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation

1 code implementation3 Aug 2023 Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT).

Decoder Quantization +8

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

no code implementations28 Jun 2023 Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro

The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio.

Face Generation

Intelligible Lip-to-Speech Synthesis with Speech Units

2 code implementations31 May 2023 Jeongsoo Choi, Minsu Kim, Yong Man Ro

Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units.

Lip to Speech Synthesis Speech Synthesis

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

no code implementations31 May 2023 Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro

The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion.

Talking Face Generation

Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

1 code implementation CVPR 2023 Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro

Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.

Audio-Visual Speech Recognition speech-recognition +1

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

no code implementations2 Nov 2022 Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro

It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.

Audio-Visual Synchronization Representation Learning +1

Cannot find the paper you are looking for? You can Submit a new open access paper.