1 code implementation • 14 Mar 2025 • Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro
Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information.
Ranked #1 on
Audio-Visual Speech Recognition
on LRS3-TED
(using extra training data)
1 code implementation • 24 Dec 2024 • Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants.
no code implementations • 23 Dec 2024 • Yeonju Kim, Se Jin Park, Yong Man Ro
Chatbot research is advancing with the growing importance of chatbots in fields that require human interactions, such as customer support and mental health care.
no code implementations • 23 Dec 2024 • Se Jin Park, Yeonju Kim, Hyeongseop Rha, Bella Godiva, Yong Man Ro
In human communication, both verbal and non-verbal cues play a crucial role in conveying emotions, intentions, and meaning beyond words alone.
no code implementations • 12 Jun 2024 • Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeong Hun Yeo, Yong Man Ro
In this paper, we introduce a novel Face-to-Face spoken dialogue model.
no code implementations • 7 Mar 2024 • Seunghee Han, Se Jin Park, Chae Won Kim, Yong Man Ro
We devise completeness loss and consistency loss based on semantic similarity scores.
1 code implementation • 18 Jan 2024 • Minsu Kim, Jeong Hun Yeo, Se Jin Park, Hyeongseop Rha, Yong Man Ro
By using the visual speech units as the inputs of our system, we propose to pre-train a VSR model to predict corresponding text outputs on multilingual data constructed by merging several VSR databases.
1 code implementation • CVPR 2024 • Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro
To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A.
no code implementations • 23 Aug 2023 • Se Jin Park, Joanna Hong, Minsu Kim, Yong Man Ro
We contribute a new large-scale 3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in identities, poses, and facial motions of 3D face mesh.
no code implementations • 28 Jun 2023 • Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro
The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio.
no code implementations • 31 May 2023 • Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro
The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion.
no code implementations • 2 Nov 2022 • Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro
It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.
no code implementations • 5 Jul 2022 • Agus Gunawan, Muhammad Adi Nugroho, Se Jin Park
We explore a different direction where we propose to improve real image denoising performance through a better learning strategy that can enable test-time adaptation on the multi-task network.
1 code implementation • ICCV 2021 • Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro
By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks.
Ranked #4 on
Lipreading
on CAS-VSR-W1k (LRW-1000)
1 code implementation • IEEE/ACM Transactions on Audio, Speech, and Language Processing 2021 • Joanna Hong, Minsu Kim, Se Jin Park, Yong Man Ro
Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features.