Search Results for author: Minsu Kim

Found 63 papers, 29 papers with code

Epistemology of Language Models: Do Language Models Have Holistic Knowledge?

no code implementations19 Mar 2024 Minsu Kim, James Thorne

This paper investigates the inherent knowledge in language models from the perspective of epistemological holism.

Accelerating String-Key Learned Index Structures via Memoization-based Incremental Training

no code implementations18 Mar 2024 Minsu Kim, Jinwoo Hwang, Guseul Heo, Seiyeon Cho, Divya Mahajan, Jongse Park

Learned indexes use machine learning models to learn the mappings between keys and their corresponding positions in key-value indexes.

Ant Colony Sampling with GFlowNets for Combinatorial Optimization

2 code implementations11 Mar 2024 Minsu Kim, Sanghyeok Choi, Jiwoo Son, Hyeonah Kim, Jinkyoo Park, Yoshua Bengio

This paper introduces the Generative Flow Ant Colony Sampler (GFACS), a novel neural-guided meta-heuristic algorithm for combinatorial optimization.

Combinatorial Optimization

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

1 code implementation23 Feb 2024 Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements.

Ranked #4 on Lipreading on LRS3-TED (using extra training data)

Lipreading Lip Reading +3

Genetic-guided GFlowNets: Advancing in Practical Molecular Optimization Benchmark

no code implementations5 Feb 2024 Hyeonah Kim, Minsu Kim, Sanghyeok Choi, Jinkyoo Park

This paper proposes a novel variant of GFlowNet, genetic-guided GFlowNet (Genetic GFN), which integrates an iterative genetic search into GFlowNet.

Bayesian Optimization

Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units

no code implementations18 Jan 2024 Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Se Jin Park, Yong Man Ro

By using the visual speech units as the inputs of our system, we pre-train the model to predict corresponding text outputs on massive multilingual data constructed by merging several VSR databases.

Sentence speech-recognition +1

Quilt: Robust Data Segment Selection against Concept Drifts

no code implementations15 Dec 2023 Minsu Kim, Seong-Hyeon Hwang, Steven Euijong Whang

However, we contend that explicitly utilizing the drifted data together leads to much better model accuracy and propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy.

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

1 code implementation5 Dec 2023 Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro

To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A.

Self-Supervised Learning Speech-to-Speech Translation +1

Learning Energy Decompositions for Partial Inference of GFlowNets

no code implementations5 Oct 2023 Hyosoon Jang, Minsu Kim, Sungsoo Ahn

In particular, we focus on improving GFlowNet with partial inference: training flow functions with the evaluation of the intermediate states or transitions.

Local Search GFlowNets

2 code implementations4 Oct 2023 Minsu Kim, Taeyoung Yun, Emmanuel Bengio, Dinghuai Zhang, Yoshua Bengio, Sungsoo Ahn, Jinkyoo Park

Generative Flow Networks (GFlowNets) are amortized sampling methods that learn a distribution over discrete objects proportional to their rewards.

Learning to Scale Logits for Temperature-Conditional GFlowNets

1 code implementation4 Oct 2023 Minsu Kim, Joohwan Ko, Taeyoung Yun, Dinghuai Zhang, Ling Pan, Woochang Kim, Jinkyoo Park, Emmanuel Bengio, Yoshua Bengio

We find that the challenge is greatly reduced if a learned function of the temperature is used to scale the policy's logits directly.

BroadBEV: Collaborative LiDAR-camera Fusion for Broad-sighted Bird's Eye View Map Construction

no code implementations20 Sep 2023 Minsu Kim, Giseop Kim, Kyong Hwan Jin, Sunwook Choi

The method boosts the learning of depth estimation of the camera branch and induces accurate location of dense camera features in BEV space.

Depth Estimation Sensor Fusion

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

no code implementations15 Sep 2023 Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.

Image Comprehension Language Modelling +1

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

no code implementations15 Sep 2023 Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention.

Language Identification speech-recognition +1

Learning Residual Elastic Warps for Image Stitching under Dirichlet Boundary Condition

1 code implementation4 Sep 2023 Minsu Kim, Yongjun Lee, Woo Kyoung Han, Kyong Hwan Jin

Trendy suggestions for learning-based elastic warps enable the deep image stitchings to align images exposed to large parallax errors.

Image Inpainting Image Stitching

Implicit Neural Image Stitching

1 code implementation4 Sep 2023 Minsu Kim, Jaewon Lee, Byeonghun Lee, Sunghoon Im, Kyong Hwan Jin

Existing frameworks for image stitching often provide visually reasonable stitchings.

Image Stitching Super-Resolution

DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

no code implementations23 Aug 2023 Se Jin Park, Joanna Hong, Minsu Kim, Yong Man Ro

We contribute a new large-scale 3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in identities, poses, and facial motions of 3D face mesh.

3D Face Animation

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

no code implementations ICCV 2023 Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro

In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units.

Lip Reading

Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

1 code implementation3 Aug 2023 Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST).

Representation Learning Speech-to-Speech Translation +4

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

no code implementations28 Jun 2023 Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro

The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio.

Face Generation

Equity-Transformer: Solving NP-hard Min-Max Routing Problems as Sequential Generation with Equity Context

1 code implementation5 Jun 2023 Jiwoo Son, Minsu Kim, Sanghyeok Choi, Hyeonah Kim, Jinkyoo Park

Notably, our method achieves significant reductions of runtime, approximately 335 times, and cost values of about 53\% compared to a competitive heuristic (LKH3) in the case of 100 vehicles with 1, 000 cities of mTSP.

Decision Making Traveling Salesman Problem

Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences

1 code implementation NeurIPS 2023 Minsu Kim, Federico Berto, Sungsoo Ahn, Jinkyoo Park

The subsequent stage involves bootstrapping, which augments the training dataset with self-generated data labeled by a proxy score function.

Intelligible Lip-to-Speech Synthesis with Speech Units

1 code implementation31 May 2023 Jeongsoo Choi, Minsu Kim, Yong Man Ro

Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units.

Lip to Speech Synthesis Speech Synthesis

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

no code implementations31 May 2023 Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro

The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion.

Talking Face Generation

PartMix: Regularization Strategy to Learn Part Discovery for Visible-Infrared Person Re-identification

no code implementations CVPR 2023 Minsu Kim, Seungryong Kim, Jungin Park, Seongheon Park, Kwanghoon Sohn

Modern data augmentation using a mixture-based technique can regularize the models from overfitting to the training data in various computer vision applications, but a proper data augmentation technique tailored for the part-based Visible-Infrared person Re-IDentification (VI-ReID) models remains unexplored.

Contrastive Learning Data Augmentation +1

Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

1 code implementation CVPR 2023 Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro

Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.

Audio-Visual Speech Recognition speech-recognition +1

Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video

no code implementations27 Feb 2023 Minsu Kim, Chae Won Kim, Yong Man Ro

The proposed DVFA can align the input transcription (i. e., sentence) with the talking face video without accessing the speech audio.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Lip-to-Speech Synthesis in the Wild with Multi-task Learning

3 code implementations17 Feb 2023 Minsu Kim, Joanna Hong, Yong Man Ro

To this end, we design multi-task learning that guides the model using multimodal supervision, i. e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss.

Lip to Speech Synthesis Multi-Task Learning +1

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition

no code implementations16 Feb 2023 Minsu Kim, Hyung-Il Kim, Yong Man Ro

As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers.

Sentence speech-recognition +1

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

no code implementations2 Nov 2022 Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro

It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.

Audio-Visual Synchronization Representation Learning +1

Meta Input: How to Leverage Off-the-Shelf Deep Neural Networks

no code implementations21 Oct 2022 Minsu Kim, Youngjoon Yu, Sungjune Park, Yong Man Ro

The proposed meta input can be optimized with a small number of testing data only by considering the relation between testing input data and its output prediction.

Speaker-adaptive Lip Reading with User-dependent Padding

1 code implementation9 Aug 2022 Minsu Kim, Hyunjun Kim, Yong Man Ro

In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding.

Lip Reading speech-recognition +1

Green, Quantized Federated Learning over Wireless Networks: An Energy-Efficient Design

no code implementations19 Jul 2022 Minsu Kim, Walid Saad, Mohammad Mozaffari, Merouane Debbah

In this paper, a green-quantized FL framework, which represents data with a finite precision level in both local training and uplink transmission, is proposed.

Federated Learning Quantization

Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition

1 code implementation13 Jul 2022 Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro

The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition.

Audio-Visual Speech Recognition Noisy Speech Recognition +2

CoVA: Exploiting Compressed-Domain Analysis to Accelerate Video Analytics

1 code implementation2 Jul 2022 Jinwoo Hwang, Minsu Kim, Daeun Kim, Seungho Nam, Yoonsung Kim, Dohee Kim, Hardik Sharma, Jongse Park

This paper presents CoVA, a novel cascade architecture that splits the cascade computation between compressed domain and pixel domain to address the decoding bottleneck, supporting both temporal and spatial queries.

VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection

no code implementations15 Jun 2022 Joanna Hong, Minsu Kim, Yong Man Ro

Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even with the silent talking face video of an unseen subject.

feature selection Speech Synthesis

DevFormer: A Symmetric Transformer for Context-Aware Device Placement

2 code implementations26 May 2022 Haeyeon Kim, Minsu Kim, Federico Berto, Joungho Kim, Jinkyoo Park

In this paper, we present DevFormer, a novel transformer-based architecture for addressing the complex and computationally demanding problem of hardware design optimization.

Combinatorial Optimization Meta-Learning

Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization

1 code implementation26 May 2022 Minsu Kim, Junyoung Park, Jinkyoo Park

Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i. e., DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers less relying on problem-specific expert domain knowledge (heuristic method) and supervised labeled data (supervised learning method).

Combinatorial Optimization Traveling Salesman Problem

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

1 code implementation The AAAI Conference on Artificial Intelligence (AAAI) 2022 Minsu Kim, Jeong Hun Yeo, Yong Man Ro

With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement.

Lip Reading

Lip to Speech Synthesis with Visual Context Attentional GAN

1 code implementation NeurIPS 2021 Minsu Kim, Joanna Hong, Yong Man Ro

In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis.

Contrastive Learning Generative Adversarial Network +2

Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

1 code implementation ICCV 2021 Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro

By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks.

Lip Reading

Speech Reconstruction with Reminiscent Sound via Visual Voice Memory

1 code implementation IEEE/ACM Transactions on Audio, Speech, and Language Processing 2021 Joanna Hong, Minsu Kim, Se Jin Park, Yong Man Ro

Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features.

Speaker-Specific Lip to Speech Synthesis

On the Tradeoff between Energy, Precision, and Accuracy in Federated Quantized Neural Networks

no code implementations15 Nov 2021 Minsu Kim, Walid Saad, Mohammad Mozaffari, Merouane Debbah

In this paper, a quantized FL framework, that represents data with a finite level of precision in both local training and uplink transmission, is proposed.

Federated Learning Quantization

Learning Collaborative Policies to Solve NP-hard Routing Problems

1 code implementation NeurIPS 2021 Minsu Kim, Jinkyoo Park, Joungho Kim

Recently, deep reinforcement learning (DRL) frameworks have shown potential for solving NP-hard routing problems such as the traveling salesman problem (TSP) without problem-specific expert knowledge.

Traveling Salesman Problem

Learning Canonical 3D Object Representation for Fine-Grained Recognition

no code implementations ICCV 2021 Sunghun Joung, Seungryong Kim, Minsu Kim, Ig-Jae Kim, Kwanghoon Sohn

By incorporating 3D shape and appearance jointly in a deep representation, our method learns the discriminative representation of the object and achieves competitive performance on fine-grained image recognition and vehicle re-identification.

3D Shape Reconstruction Fine-Grained Image Recognition +3

Precoding Design for Multi-user MIMO Systems with Delay-Constrained and -Tolerant Users

no code implementations17 Jun 2021 Minsu Kim, Jeonghun Park, Jemin Lee

We consider an optimization problem that maximizes the sum spectral efficiency of delay-tolerant users while satisfying the latency constraint of delay-constrained users, and propose a generalized power iteration (GPI) precoding algorithm that finds a principal precoding vector.

Non-Terrestrial Networks for UAVs: Base Station Service Provisioning Schemes with Antenna Tilt

no code implementations14 Apr 2021 Seongjun Kim, Minsu Kim, Jong Yeol Ryu, Jemin Lee, Tony Q. S. Quek

By considering the antenna tilt angle-based channel gain, we derive the network outage probability for both IS-BS and ES-BS schemes, and show the existence of the optimal tilt angle that minimizes the network outage probability after analyzing the conflict impact of the antenna tilt angle.

Securing Communications with Friendly Unmanned Aerial Vehicle Jammers

no code implementations17 Dec 2020 Minsu Kim, Seongjun Kim, Jemin Lee

In this paper, we analyze the impact of a friendly unmanned aerial vehicle (UAV) jammer on UAV communications in the presence of multiple eavesdroppers.

Cross-Domain Grouping and Alignment for Domain Adaptive Semantic Segmentation

1 code implementation15 Dec 2020 Minsu Kim, Sunghun Joung, Seungryong Kim, Jungin Park, Ig-Jae Kim, Kwanghoon Sohn

Existing techniques to adapt semantic segmentation networks across the source and target domains within deep convolutional neural networks (CNNs) deal with all the samples from the two domains in a global or category-aware manner.

Clustering Domain Adaptation +2

Ensuring Data Freshness for Blockchain-enabled Monitoring Networks

no code implementations12 Nov 2020 Minsu Kim, Sungho Lee, Chanwon Park, Jemin Lee, Walid Saad

The age of information (AoI) is a recently proposed metric for quantifying data freshness in real-time status monitoring systems where timeliness is of importance.

Age of Information Analysis in Hyperledger Fabric Blockchain-enabled Monitoring Networks

no code implementations28 Oct 2020 Minsu Kim, Sungho Lee, Chanwon Park, Jemin Lee

In this paper, we explore the data freshness in the Hyperledger Fabric Blockchain-enabled monitoring network (HeMN) by leveraging the AoI metric.

Cylindrical Convolutional Networks for Joint Object Detection and Viewpoint Estimation

no code implementations CVPR 2020 Sunghun Joung, Seungryong Kim, Hanjae Kim, Minsu Kim, Ig-Jae Kim, Junghyun Cho, Kwanghoon Sohn

To overcome this limitation, we introduce a learnable module, cylindrical convolutional networks (CCNs), that exploit cylindrical representation of a convolutional kernel defined in the 3D space.

Object object-detection +2

Cannot find the paper you are looking for? You can Submit a new open access paper.