Search Results for author: Yong Man Ro

Found 72 papers, 26 papers with code

SACA Net: Cybersickness Assessment of Individual Viewers for VR Content via Graph-based Symptom Relation Embedding

no code implementations ECCV 2020 Sangmin Lee, Jung Uk Kim, Hak Gu Kim, Seongyeop Kim, Yong Man Ro

In this paper, we propose a novel symptom-aware cybersickness assessment network (SACA Net) that quantifies physical symptom levels for assessing cybersickness of individual viewers.

Relation

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection

no code implementations22 Mar 2024 Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong Man Ro

Specifically, we generate text descriptions of the pedestrian in each RGB and thermal modality and design a Multispectral Chain-of-Thought (MSCoT) prompting, which models a step-by-step process to facilitate cross-modal reasoning at the semantic level and perform accurate detection.

Pedestrian Detection

What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models

1 code implementation20 Mar 2024 Junho Kim, Yeon Ju Kim, Yong Man Ro

This paper presents a way of enhancing the reliability of Large Multimodal Models (LMMs) in addressing hallucination effects, where models generate incorrect or unrelated responses.

counterfactual Hallucination

MoAI: Mixture of All Intelligence for Large Language and Vision Models

1 code implementation12 Mar 2024 Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models.

Scene Understanding Visual Question Answering

Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection

1 code implementation2 Mar 2024 Taeheon Kim, Sebin Shin, Youngjoon Yu, Hak Gu Kim, Yong Man Ro

As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data.

Pedestrian Detection

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

1 code implementation23 Feb 2024 Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements.

Ranked #4 on Lipreading on LRS3-TED (using extra training data)

Lipreading Lip Reading +3

CoLLaVO: Crayon Large Language and Vision mOdel

1 code implementation17 Feb 2024 Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks.

Large Language Model Object +3

Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units

no code implementations18 Jan 2024 Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Se Jin Park, Yong Man Ro

By using the visual speech units as the inputs of our system, we pre-train the model to predict corresponding text outputs on massive multilingual data constructed by merging several VSR databases.

Sentence speech-recognition +1

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

1 code implementation5 Dec 2023 Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro

To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A.

Self-Supervised Learning Speech-to-Speech Translation +1

Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection

1 code implementation2 Nov 2023 Sungjune Park, Hyunjun Kim, Yong Man Ro

The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector.

Pedestrian Detection

Causal Unsupervised Semantic Segmentation

1 code implementation11 Oct 2023 Junho Kim, Byung-Kwan Lee, Yong Man Ro

Unsupervised semantic segmentation aims to achieve high-quality semantic grouping without human-labeled annotations.

Causal Inference Segmentation +2

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

no code implementations15 Sep 2023 Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.

Image Comprehension Language Modelling +1

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

no code implementations15 Sep 2023 Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention.

Language Identification speech-recognition +1

DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

no code implementations23 Aug 2023 Se Jin Park, Joanna Hong, Minsu Kim, Yong Man Ro

We contribute a new large-scale 3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in identities, poses, and facial motions of 3D face mesh.

3D Face Animation

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

no code implementations ICCV 2023 Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro

In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units.

Lip Reading

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

1 code implementation ICCV 2023 Jeongsoo Choi, Joanna Hong, Yong Man Ro

In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time.

Speech Synthesis

Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

1 code implementation3 Aug 2023 Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST).

Representation Learning Speech-to-Speech Translation +4

Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Double Machine Learning

1 code implementation ICCV 2023 Byung-Kwan Lee, Junho Kim, Yong Man Ro

Adversarial examples derived from deliberately crafted perturbations on visual inputs can easily harm decision process of deep neural networks.

Adversarial Robustness

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

no code implementations28 Jun 2023 Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro

The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio.

Face Generation

Advancing Adversarial Training by Injecting Booster Signal

no code implementations27 Jun 2023 Hong Joo Lee, Youngjoon Yu, Yong Man Ro

Different from the previous approaches, in this paper, we propose a new approach to improve the adversarial robustness by using an external signal rather than model parameters.

Adversarial Robustness

Robust Proxy: Improving Adversarial Robustness by Robust Proxy Learning

no code implementations27 Jun 2023 Hong Joo Lee, Yong Man Ro

With the class-wise robust features, the model explicitly learns adversarially robust features through the proposed robust proxy learning framework.

Adversarial Robustness

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

no code implementations31 May 2023 Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro

The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion.

Talking Face Generation

Intelligible Lip-to-Speech Synthesis with Speech Units

1 code implementation31 May 2023 Jeongsoo Choi, Minsu Kim, Yong Man Ro

Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units.

Lip to Speech Synthesis Speech Synthesis

Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

1 code implementation CVPR 2023 Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro

Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.

Audio-Visual Speech Recognition speech-recognition +1

Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression

1 code implementation CVPR 2023 Junho Kim, Byung-Kwan Lee, Yong Man Ro

The origin of adversarial examples is still inexplicable in research fields, and it arouses arguments from various viewpoints, albeit comprehensive investigations.

Adversarial Robustness

Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video

no code implementations27 Feb 2023 Minsu Kim, Chae Won Kim, Yong Man Ro

The proposed DVFA can align the input transcription (i. e., sentence) with the talking face video without accessing the speech audio.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Lip-to-Speech Synthesis in the Wild with Multi-task Learning

3 code implementations17 Feb 2023 Minsu Kim, Joanna Hong, Yong Man Ro

To this end, we design multi-task learning that guides the model using multimodal supervision, i. e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss.

Lip to Speech Synthesis Multi-Task Learning +1

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition

no code implementations16 Feb 2023 Minsu Kim, Hyung-Il Kim, Yong Man Ro

As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers.

Sentence speech-recognition +1

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

no code implementations2 Nov 2022 Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro

It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.

Audio-Visual Synchronization Representation Learning +1

Meta Input: How to Leverage Off-the-Shelf Deep Neural Networks

no code implementations21 Oct 2022 Minsu Kim, Youngjoon Yu, Sungjune Park, Yong Man Ro

The proposed meta input can be optimized with a small number of testing data only by considering the relation between testing input data and its output prediction.

Speaker-adaptive Lip Reading with User-dependent Padding

1 code implementation9 Aug 2022 Minsu Kim, Hyunjun Kim, Yong Man Ro

In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding.

Lip Reading speech-recognition +1

Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition

1 code implementation13 Jul 2022 Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro

The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition.

Audio-Visual Speech Recognition Noisy Speech Recognition +2

VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection

no code implementations15 Jun 2022 Joanna Hong, Minsu Kim, Yong Man Ro

Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even with the silent talking face video of an unseen subject.

feature selection Speech Synthesis

Defending Person Detection Against Adversarial Patch Attack by using Universal Defensive Frame

no code implementations27 Apr 2022 Youngjoon Yu, Hong Joo Lee, Hakmin Lee, Yong Man Ro

Person detection has attracted great attention in the computer vision area and is an imperative element in human-centric computer vision.

Autonomous Driving Human Detection +2

Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck

1 code implementation NeurIPS 2021 Junho Kim, Byung-Kwan Lee, Yong Man Ro

Adversarial examples, generated by carefully crafted perturbation, have attracted considerable attention in research fields.

Adversarial Robustness

Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

1 code implementation ICCV 2021 Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro

By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks.

Lip Reading

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

1 code implementation The AAAI Conference on Artificial Intelligence (AAAI) 2022 Minsu Kim, Jeong Hun Yeo, Yong Man Ro

With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement.

Lip Reading

Lip to Speech Synthesis with Visual Context Attentional GAN

1 code implementation NeurIPS 2021 Minsu Kim, Joanna Hong, Yong Man Ro

In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis.

Contrastive Learning Generative Adversarial Network +2

Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory

no code implementations CVPR 2022 Sangmin Lee, Hyung-Il Kim, Yong Man Ro

Existing sound and image representation learning methods necessarily require a large number of sound and image with corresponding pairs.

Representation Learning

Speech Reconstruction with Reminiscent Sound via Visual Voice Memory

1 code implementation IEEE/ACM Transactions on Audio, Speech, and Language Processing 2021 Joanna Hong, Minsu Kim, Se Jin Park, Yong Man Ro

Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features.

Speaker-Specific Lip to Speech Synthesis

Towards a Better Understanding of VR Sickness: Physical Symptom Prediction for VR Contents

no code implementations14 Apr 2021 Hak Gu Kim, Sangmin Lee, Seongyeop Kim, Heoun-taek Lim, Yong Man Ro

To make better understanding of VR sickness, it is required to predict and provide the level of major symptoms of VR sickness rather than overall degree of VR sickness.

Robust Small-Scale Pedestrian Detection With Cued Recall via Memory Learning

no code implementations ICCV 2021 Jung Uk Kim, Sungjune Park, Yong Man Ro

The purpose of the proposed large-scale embedding learning is to memorize and recall the large-scale pedestrian appearance via the LPR Memory.

Pedestrian Detection

Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical Variational Inference

1 code implementation1 Jan 2021 Byung-Kwan Lee, Youngjoon Yu, Yong Man Ro

Recent works have applied Bayesian Neural Network (BNN) to adversarial training, and shown the improvement of adversarial robustness via the BNN's strength of stochastic gradient defense.

Adversarial Defense Adversarial Robustness +3

Comprehensive Facial Expression Synthesis using Human-Interpretable Language

no code implementations16 Jul 2020 Joanna Hong, Jung Uk Kim, Sangmin Lee, Yong Man Ro

Recent advances in facial expression synthesis have shown promising results using diverse expression representations including facial action units.

Investigating Vulnerability to Adversarial Examples on Multimodal Data Fusion in Deep Learning

no code implementations22 May 2020 Youngjoon Yu, Hong Joo Lee, Byeong Cheon Kim, Jung Uk Kim, Yong Man Ro

The success of multimodal data fusion in deep learning appears to be attributed to the use of complementary in-formation between multiple input data.

Adversarial Attack Adversarial Robustness +1

Robust Ensemble Model Training via Random Layer Sampling Against Adversarial Attack

no code implementations21 May 2020 Hakmin Lee, Hong Joo Lee, Seong Tae Kim, Yong Man Ro

After the ensemble models are trained, it can hide the gradient efficiently and avoid the gradient-based attack by the random layer sampling method.

Adversarial Attack Adversarial Robustness

Revisiting Role of Autoencoders in Adversarial Settings

no code implementations21 May 2020 Byeong Cheon Kim, Jung Uk Kim, Hakmin Lee, Yong Man Ro

Through the comprehensive experimental results and analysis, this paper presents the inherent property of adversarial robustness in the autoencoders.

Adversarial Defense Adversarial Robustness +1

Efficient Ensemble Model Generation for Uncertainty Estimation with Bayesian Approximation in Segmentation

no code implementations21 May 2020 Hong Joo Lee, Seong Tae Kim, Hakmin Lee, Nassir Navab, Yong Man Ro

Experimental results show that the proposed method could provide useful uncertainty information by Bayesian approximation with the efficient ensemble model generation and improve the predictive performance.

Segmentation

Generative Guiding Block: Synthesizing Realistic Looking Variants Capable of Even Large Change Demands

no code implementations2 Jul 2019 Minho Park, Hak Gu Kim, Yong Man Ro

Generating realistic looking images with large variations (e. g., large spatial deformations and large pose change), however, is very challenging.

Image Generation

Generation of Multimodal Justification Using Visual Word Constraint Model for Explainable Computer-Aided Diagnosis

no code implementations10 Jun 2019 Hyebin Lee, Seong Tae Kim, Yong Man Ro

The ambiguity of the decision-making process has been pointed out as the main obstacle to applying the deep learning-based method in a practical way in spite of its outstanding performance.

Decision Making Sentence

Feature2Mass: Visual Feature Processing in Latent Space for Realistic Labeled Mass Generation

no code implementations17 Sep 2018 Jae-Hyeok Lee, Seong Tae Kim, Hakmin Lee, Yong Man Ro

In order to learn deep network model to be well-behaved in bio-image computing fields, a lot of labeled data is required.

Image Generation

ICADx: Interpretable computer aided diagnosis of breast masses

no code implementations23 May 2018 Seong Tae Kim, Hakmin Lee, Hak Gu Kim, Yong Man Ro

In this paper, we investigate interpretability in CADx with the proposed interpretable CADx (ICADx) framework.

Generative Adversarial Network

STAN: Spatio-Temporal Adversarial Networks for Abnormal Event Detection

no code implementations23 Apr 2018 Sangmin Lee, Hak Gu Kim, Yong Man Ro

In this paper, we propose a novel abnormal event detection method with spatio-temporal adversarial networks (STAN).

Anomaly Detection Event Detection

VR IQA NET: Deep Virtual Reality Image Quality Assessment using Adversarial Learning

no code implementations11 Apr 2018 Heoun-taek Lim, Hak Gu Kim, Yong Man Ro

The proposed human perception guider criticizes the predicted quality score of the predictor with the human perceptual score using adversarial learning.

Image Quality Assessment Position

Facial Dynamics Interpreter Network: What are the Important Relations between Local Dynamics for Facial Trait Estimation?

no code implementations ECCV 2018 Seong Tae Kim, Yong Man Ro

In this paper, a novel deep learning approach, named facial dynamics interpreter network, has been proposed to interpret the important relations between local dynamics for estimating facial traits from expression sequence.

Age Estimation Gender Classification +1

Learning Spatio-temporal Features with Partial Expression Sequences for on-the-Fly Prediction

no code implementations29 Nov 2017 Wissam J. Baddar, Yong Man Ro

At test time, most spatio-temporal encoding methods assume that a temporally segmented sequence is fed to a learned model, which could require the prediction to wait until the full sequence is available to an auxiliary task that performs the temporal segmentation.

Modality-bridge Transfer Learning for Medical Image Classification

no code implementations10 Aug 2017 Hak Gu Kim, Yeoreum Choi, Yong Man Ro

This paper presents a new approach of transfer learning-based medical image classification to mitigate insufficient labeled data problem in medical domain.

General Classification Image Classification +2

Convolution with Logarithmic Filter Groups for Efficient Shallow CNN

no code implementations31 Jul 2017 Tae Kwan Lee, Wissam J. Baddar, Seong Tae Kim, Yong Man Ro

Our classification results on Multi-PIE dataset for facial expression recognition and CIFAR-10 dataset for object classification reveal that the compact CNN with the proposed logarithmic filter grouping scheme outperforms the same network with the uniform filter grouping in terms of accuracy and parameter efficiency.

Classification Facial Expression Recognition +2

EvaluationNet: Can Human Skill be Evaluated by Deep Networks?

no code implementations31 May 2017 Seong Tae Kim, Yong Man Ro

In order to improve the effectiveness of the learning with instructional video, observation and evaluation of the activity are required.

Cannot find the paper you are looking for? You can Submit a new open access paper.