1 code implementation • 23 Feb 2024 • Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro
In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements.
Ranked #4 on Lipreading on LRS3-TED (using extra training data)
no code implementations • 18 Jan 2024 • Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Se Jin Park, Yong Man Ro
By using the visual speech units as the inputs of our system, we pre-train the model to predict corresponding text outputs on massive multilingual data constructed by merging several VSR databases.
no code implementations • 15 Sep 2023 • Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro
To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.
no code implementations • 15 Sep 2023 • Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro
Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention.
no code implementations • ICCV 2023 • Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro
In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units.
no code implementations • 15 Aug 2023 • Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements.
no code implementations • 8 May 2023 • Jeong Hun Yeo, Minsu Kim, Yong Man Ro
Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • The AAAI Conference on Artificial Intelligence (AAAI) 2022 • Minsu Kim, Jeong Hun Yeo, Yong Man Ro
With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement.
Ranked #2 on Lipreading on CAS-VSR-W1k (LRW-1000)