no code implementations • 18 Mar 2024 • SooHwan Eom, Eunseop Yoon, Hee Suk Yoon, Chanwoo Kim, Mark Hasegawa-Johnson, Chang D. Yoo
In Automatic Speech Recognition (ASR) systems, a recurring obstacle is the generation of narrowly focused output distributions.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 29 Jan 2024 • Ian Covert, Chanwoo Kim, Su-In Lee, James Zou, Tatsunori Hashimoto
Many tasks in explainable machine learning, such as data valuation and feature attribution, perform expensive computation for each data point and can be intractable for large datasets.
no code implementations • 19 Jan 2024 • Abhinav Garg, Jiyeon Kim, Sushil Khyalia, Chanwoo Kim, Dhananjaya Gowda
Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system.
no code implementations • 14 Dec 2023 • Junsu Kim, Sumin Hong, Chanwoo Kim, Jihyeon Kim, Yihalem Yimolal Tiruneh, Jeongwan On, Jihyun Song, Sunhwa Choi, Seungryul Baek
In this paper, we introduce an effective buffer training strategy (eBTS) that creates the optimized replay buffer on object detection.
no code implementations • 5 Oct 2023 • Jae-Sung Bae, Joun Yeop Lee, Ji-Hyun Lee, Seongkyu Mun, Taehwa Kang, Hoon-Young Cho, Chanwoo Kim
Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data.
no code implementations • 16 Aug 2023 • Eunseop Yoon, Hee Suk Yoon, Dhananjaya Gowda, SooHwan Eom, Daehyeok Kim, John Harvill, Heting Gao, Mark Hasegawa-Johnson, Chanwoo Kim, Chang D. Yoo
Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction.
1 code implementation • CVPR 2023 • Hoseong Cho, Chanwoo Kim, Jihyeon Kim, Seongyeong Lee, Elkhan Ismayilzada, Seungryul Baek
In our framework, we insert the whole image depicting two hands, an object and their interactions as input and jointly estimate 3 information from each frame: poses of two hands, pose of an object and object types.
no code implementations • 29 Dec 2022 • Chanwoo Kim, Sathish Indurti, Jinhwan Park, Wonyong Sung
In our work, we define a macro-block that contains a large number of units from the input to a Recurrent Neural Network (RNN).
no code implementations • 6 Nov 2022 • JIhwan Lee, Jae-Sung Bae, Seongkyu Mun, Heejin Choi, Joun Yeop Lee, Hoon-Young Cho, Chanwoo Kim
With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise.
no code implementations • 20 Oct 2022 • Hoseong Cho, Donguk Kim, Chanwoo Kim, Seongyeong Lee, Seungryul Baek
In this challenge, we aim to estimate global 3D hand poses from the input image where two hands and an object are interacting on the egocentric viewpoint.
1 code implementation • 30 Sep 2022 • Chris Lin, Hugh Chen, Chanwoo Kim, Su-In Lee
To address this, we propose contrastive corpus similarity, a novel and semantically meaningful scalar explanation output based on a reference corpus and a contrasting foil set of samples.
2 code implementations • 10 Jun 2022 • Ian Covert, Chanwoo Kim, Su-In Lee
Transformers have become a default architecture in computer vision, but understanding what drives their predictions remains a challenging problem.
no code implementations • 4 Apr 2022 • JIhwan Lee, Joun Yeop Lee, Heejin Choi, Seongkyu Mun, Sangjun Park, Jae-Sung Bae, Chanwoo Kim
Two proposed modules are added to the end-to-end TTS framework: an intonation predictor and an intonation encoder.
no code implementations • 8 Jan 2022 • Nauman Dawalatabad, Tushar Vatsal, Ashutosh Gupta, Sungsoo Kim, Shatrughan Singh, Dhananjaya Gowda, Chanwoo Kim
With the use of popular transducer-based models, it has become possible to practically deploy streaming speech recognition models on small devices [1].
no code implementations • 19 Nov 2021 • Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim
However, we observe that training of MoChA models seems to be more sensitive to various factors such as the characteristics of training sets and the incorporation of additional augmentations techniques.
no code implementations • 19 Nov 2021 • Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim
To improve the accuracy of a low-resource Italian ASR, we leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS augmentation, and SSL respectively.
no code implementations • 13 Oct 2021 • Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim
Hence, read/write decision policy remains the same across different input modalities, i. e., speech and text.
no code implementations • ICASSP 2021 • Sathish Indurthi, Mohd Abbas Zaidi, Nikhil Kumar Lakumarapu, Beomseok Lee, Hyojung Han, Seokchan Ahn, Sangha Kim, Chanwoo Kim, Inchul Hwang
In general, the direct Speech-to-text translation (ST) is jointly trained with Automatic Speech Recognition (ASR), and Machine Translation (MT) tasks.
Ranked #1 on Speech-to-Text Translation on MuST-C EN->DE (using extra training data)
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
no code implementations • 4 May 2021 • Chanwoo Kim, Abhinav Garg, Dhananjaya Gowda, Seongkyu Mun, Changwoo Han
In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers.
no code implementations • 29 Dec 2020 • Hyojung Han, Sathish Indurthi, Mohd Abbas Zaidi, Nikhil Kumar Lakumarapu, Beomseok Lee, Sangha Kim, Chanwoo Kim, Inchul Hwang
The current re-translation approaches are based on autoregressive sequence generation models (ReTA), which generate tar-get tokens in the (partial) translation sequentially.
no code implementations • 14 Dec 2020 • Chanwoo Kim, Dhananjaya Gowda, Dongsoo Lee, Jiyeon Kim, Ankur Kumar, Sungsoo Kim, Abhinav Garg, Changwoo Han
Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, a decoder based on a Weighted Finite State Transducer (WFST), and so on.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 15 Feb 2020 • Chanwoo Kim, Kwangyoun Kim, Sathish Reddy Indurthi
More specifically, a time-frequency bin is masked if the filterbank energy in this bin is less than a certain energy threshold.
no code implementations • 2 Jan 2020 • Kwangyoun Kim, Kyungmin Lee, Dhananjaya Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim
In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 28 Dec 2019 • Abhinav Garg, Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Mehul Kumar, Chanwoo Kim
In this paper, we propose a refined multi-stage multi-task training strategy to improve the performance of online attention-based encoder-decoder (AED) models.
no code implementations • 22 Dec 2019 • Chanwoo Kim, Sungsoo Kim, Kwangyoun Kim, Mehul Kumar, Jiyeon Kim, Kyungmin Lee, Changwoo Han, Abhinav Garg, Eunhyang Kim, Minkyoo Shin, Shatrughan Singh, Larry Heck, Dhananjaya Gowda
Our end-to-end speech recognition system built using this training infrastructure showed a 2. 44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM).
no code implementations • 22 Dec 2019 • Chanwoo Kim, Mehul Kumar, Kwangyoun Kim, Dhananjaya Gowda
With the power function-based MUD, we apply a power-function based nonlinearity where power function coefficients are chosen to maximize the likelihood assuming that nonlinearity outputs follow the uniform distribution.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 11 Nov 2019 • Sathish Indurthi, Houjeung Han, Nikhil Kumar Lakumarapu, Beomseok Lee, Insoo Chung, Sangha Kim, Chanwoo Kim
In the meta-learning phase, the parameters of the model are exposed to vast amounts of speech transcripts (e. g., English ASR) and text translations (e. g., English-German MT).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +6