2 code implementations • 18 May 2023 • Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe
Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 20 Dec 2022 • Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe
In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape.
no code implementations • 16 Dec 2022 • Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe
During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
1 code implementation • NAACL 2022 • Ankita Pasad, Felix Wu, Suwon Shon, Karen Livescu, Kyu J. Han
In this work we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond self-supervised pre-training, how can we use external speech and/or text data that are not annotated for the task?
1 code implementation • 19 Nov 2021 • Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen Livescu, Kyu J. Han
Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks.
Ranked #1 on
Named Entity Recognition (NER)
on SLUE
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+7
no code implementations • 11 Jun 2021 • Suwon Shon, Pablo Brusco, Jing Pan, Kyu J. Han, Shinji Watanabe
In this paper, we explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 11 May 2019 • Achintya kr. Sarkar, Zheng-Hua Tan, Hao Tang, Suwon Shon, James Glass
There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 7 Apr 2019 • Suwon Shon, Hao Tang, James Glass
In this paper, we propose VoiceID loss, a novel loss function for training a speech enhancement model to improve the robustness of speaker verification.
no code implementations • 4 Dec 2018 • Suwon Shon, Ahmed Ali, James Glass
An important issue for end-to-end systems is to have some knowledge of the application domain, because the system can be vulnerable to use cases that were not seen in the training phase; such a scenario is often referred to as a domain mismatched condition.
no code implementations • 27 Nov 2018 • Suwon Shon, Tae-Hyun Oh, James Glass
In this paper, we present a multi-modal online person verification system using both speech and visual signals.
no code implementations • 27 Nov 2018 • Suwon Shon, Young-Gun Lee, Taesu Kim
In this paper, we proposed Random Speaker-variability Subspace (RSS) projection to map a data into LSH based hash tables.
2 code implementations • 23 Nov 2018 • Young-Gun Lee, Suwon Shon, Taesu Kim
First, we train the speech synthesis network bilingually in English and Korean and analyze how the network learns the relations of phoneme pronunciation between the languages.
no code implementations • 12 Sep 2018 • Suwon Shon, Wei-Ning Hsu, James Glass
In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID).
1 code implementation • 12 Sep 2018 • Suwon Shon, Hao Tang, James Glass
In this paper, we propose a Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings.
no code implementations • COLING 2018 • Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samard{\v{z}}i{\'c}, Nikola Ljube{\v{s}}i{\'c}, J{\"o}rg Tiedemann, Chris van der Lee, Stefan Grondelaers, Nelleke Oostdijk, Dirk Speelman, Antal Van den Bosch, Ritesh Kumar, Bornini Lahiri, Mayank Jain
We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects.
1 code implementation • 17 Jul 2018 • Suwon Shon, Najim Dehak, Douglas Reynolds, James Glass
The Multitarget Challenge aims to assess how well current speech technology is able to determine whether or not a recorded utterance was spoken by one of a large number of 'blacklisted' speakers.
Audio and Speech Processing Sound
2 code implementations • 12 Mar 2018 • Suwon Shon, Ahmed Ali, James Glass
Although the Siamese network with language embeddings did not achieve as good a result as the end-to-end DID system, the two approaches had good synergy when combined together in a fused system.
Sound Audio and Speech Processing
no code implementations • 28 Aug 2017 • Suwon Shon, Ahmed Ali, James Glass
In order to achieve a robust ADI system, we explored both Siamese neural network models to learn similarity and dissimilarities among Arabic dialects, as well as i-vector post-processing to adapt domain mismatches.
no code implementations • 3 Feb 2017 • Suwon Shon, Hanseok Ko
As development dataset which is spoken in Cebuano and Mandarin, we could prepare the evaluation trials through preliminary experiments to compensate the language mismatched condition.
no code implementations • 21 Sep 2016 • Suwon Shon, Seongkyu Mun, John H. L. Hansen, Hanseok Ko
The experimental results show that the use of duration and score fusion improves language recognition performance by 5% relative in LRiMLC15 cost.