no code implementations • 20 Dec 2022 • Yi Zhou, Zhizheng Wu, Mingyang Zhang, Xiaohai Tian, Haizhou Li
Specifically, a text-to-speech (TTS) system is first pretrained with target-accented speech data.
no code implementations • 18 Dec 2022 • Chen Zhang, Luis Fernando D'Haro, Qiquan Zhang, Thomas Friedrichs, Haizhou Li
To tackle the multi-domain dialogue evaluation task, we propose a Panel of Experts (PoE), a multitask network that consists of a shared transformer encoder and a collection of lightweight adapters.
1 code implementation • 17 Dec 2022 • Bin Wang, Haizhou Li
We present Relational Sentence Embedding (RSE), a new paradigm to further discover the potential of sentence embeddings.
2 code implementations • 20 Nov 2022 • Jiawei Du, Yidi Jiang, Vincent Y. F. Tan, Joey Tianyi Zhou, Haizhou Li
To alleviate the adverse impact of this accumulated trajectory error, we propose a novel approach that encourages the optimization algorithm to seek a flat trajectory.
no code implementations • 18 Nov 2022 • Xiaoxue Gao, Xianghu Yue, Haizhou Li
The current lyrics transcription approaches heavily rely on supervised learning with labeled data, but such data are scarce and manual labeling of singing is expensive.
no code implementations • 2 Nov 2022 • Kong Aik Lee, Tomi Kinnunen, Daniele Colibro, Claudio Vair, Andreas Nautsch, Hanwu Sun, Liang He, Tianyu Liang, Qiongqiong Wang, Mickael Rouvier, Pierre-Michel Bousquet, Rohan Kumar Das, Ignacio Viñals Bailo, Meng Liu, Héctor Deldago, Xuechen Liu, Md Sahidullah, Sandro Cumani, Boning Zhang, Koji Okabe, Hitoshi Yamamoto, Ruijie Tao, Haizhou Li, Alfonso Ortega Giménez, Longbiao Wang, Luis Buera
This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge.
1 code implementation • 31 Oct 2022 • Zexu Pan, Wupeng Wang, Marvin Borsdorf, Haizhou Li
In this paper, we study the audio-visual speaker extraction algorithms with intermittent visual cue.
1 code implementation • 30 Oct 2022 • Yiming Chen, Yan Zhang, Bin Wang, Zuozhu Liu, Haizhou Li
Most sentence embedding techniques heavily rely on expensive human-annotated sentence pairs as the supervised signals.
no code implementations • 30 Oct 2022 • Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li
Firstly, due to the distinct characteristics between speech and text modalities, where speech is continuous while text is discrete, we first discretize speech into a sequence of discrete speech tokens to solve the modality mismatch problem.
1 code implementation • 28 Oct 2022 • Ruijie Tao, Kong Aik Lee, Zhan Shi, Haizhou Li
However, noisy samples (i. e., with wrong labels) in the training set induce confusion and cause the network to learn the incorrect representation.
no code implementations • 27 Oct 2022 • Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li
We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels.
1 code implementation • 27 Oct 2022 • Yifan Hu, Rui Liu, Guanglai Gao, Haizhou Li
Therefore, we propose a novel expressive conversational TTS model, termed as FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
1 code implementation • 27 Oct 2022 • Haolin Zuo, Rui Liu, Jinming Zhao, Guanglai Gao, Haizhou Li
Multimodal emotion recognition leverages complementary information across modalities to gain performance.
no code implementations • 27 Oct 2022 • Rui Liu, Haolin Zuo, De Hu, Guanglai Gao, Haizhou Li
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1).
no code implementations • 25 Oct 2022 • Kun Zhou, Berrak Sisman, Carlos Busso, Haizhou Li
Each attribute measures the degree of the relevance between the speech recordings belonging to different emotion types.
2 code implementations • 25 Oct 2022 • Chen Zhang, Luis Fernando D'Haro, Qiquan Zhang, Thomas Friedrichs, Haizhou Li
Recent model-based reference-free metrics for open-domain dialogue evaluation exhibit promising correlations with human judgment.
1 code implementation • 21 Oct 2022 • Bin Wang, Chen Zhang, Yan Zhang, Yiming Chen, Haizhou Li
The factual correctness of summaries has the highest priority before practical applications.
1 code implementation • 10 Oct 2022 • Qu Yang, Jibin Wu, Malu Zhang, Yansong Chua, Xinchao Wang, Haizhou Li
The LTL rule follows the teacher-student learning approach by mimicking the intermediate feature representations of a pre-trained ANN.
1 code implementation • 24 Sep 2022 • Bin Wang, Chen Zhang, Chengwei Wei, Haizhou Li
Output length is critical to dialogue summarization systems.
no code implementations • 23 Sep 2022 • Qutang Cai, Guoqiang Hong, Zhijian Ye, Ximin Li, Haizhou Li
This technical report describes our system for track 1, 2 and 4 of the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22).
no code implementations • 22 Sep 2022 • Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li
Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern.
1 code implementation • 12 Sep 2022 • Xiaoyi Qin, Ming Li, Hui Bu, Shrikanth Narayanan, Haizhou Li
In addition, a supplementary set for the FFSVC2020 dataset is released this year.
no code implementations • 11 Aug 2022 • Kun Zhou, Berrak Sisman, Rajib Rana, B. W. Schuller, Haizhou Li
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
no code implementations • 15 Jul 2022 • Xiaoxue Gao, Chitralekha Gupta, Haizhou Li
Lyrics transcription of polyphonic music is challenging as the background music affects lyrics intelligibility.
1 code implementation • 15 Jun 2022 • Rui Liu, Berrak Sisman, Björn Schuller, Guanglai Gao, Haizhou Li
In this paper, we propose a data-driven deep learning model, i. e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
1 code implementation • ACL 2022 • Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, Haizhou Li
In this work, we propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED, which contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9, 082 turns and 24, 449 utterances.
1 code implementation • 7 Apr 2022 • Xiaoxue Gao, Chitralekha Gupta, Haizhou Li
To improve the robustness of lyrics transcription to the background music, we propose a strategy of combining the features that emphasize the singing vocals, i. e. music-removed features that represent singing vocal extracted features, and the features that capture the singing vocals as well as the background music, i. e. music-present features.
no code implementations • 7 Apr 2022 • Xiaoxue Gao, Chitralekha Gupta, Haizhou Li
Lyrics transcription of polyphonic music is challenging not only because the singing vocals are corrupted by the background music, but also because the background music and the singing style vary across music genres, such as pop, metal, and hip hop, which affects lyrics intelligibility of the song in different ways.
1 code implementation • 31 Mar 2022 • Zexu Pan, Meng Ge, Haizhou Li
We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 31 Mar 2022 • Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, LiRong Dai, Jinyu Li, Yao Qian, Furu Wei
In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
1 code implementation • 31 Mar 2022 • Zexu Pan, Xinyuan Qian, Haizhou Li
Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech.
1 code implementation • 29 Mar 2022 • Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li
LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with the HuBERT size, achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters, and obtains a $3. 5\times$ compression ratio in three SUPERB tasks, e. g., automatic speaker verification, keyword spotting, and intent classification, with a slight accuracy loss.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+6
1 code implementation • ACL 2022 • Bin Wang, C. -C. Jay Kuo, Haizhou Li
Word and sentence similarity tasks have become the de facto evaluation method.
1 code implementation • 21 Feb 2022 • Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li
Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance.
no code implementations • 17 Feb 2022 • Jiangyan Yi, Ruibo Fu, JianHua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu
Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021.
no code implementations • 3 Feb 2022 • Tianchi Liu, Rohan Kumar Das, Kong Aik Lee, Haizhou Li
The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification.
no code implementations • 10 Jan 2022 • Kun Zhou, Berrak Sisman, Rajib Rana, Björn W. Schuller, Haizhou Li
As desired, the proposed network controls the fine-grained emotion intensity in the output speech.
1 code implementation • 14 Dec 2021 • Chen Zhang, Luis Fernando D'Haro, Thomas Friedrichs, Haizhou Li
Chatbots are designed to carry out human-like conversations across different domains, such as general chit-chat, knowledge exchange, and persona-grounded conversations.
Ranked #1 on
Dialogue Evaluation
on USR-TopicalChat
2 code implementations • 12 Nov 2021 • Rohan Kumar Das, Ruijie Tao, Haizhou Li
This work provides a brief description of Human Language Technology (HLT) Laboratory, National University of Singapore (NUS) system submission for 2020 NIST conversational telephone speech (CTS) speaker recognition evaluation (SRE).
no code implementations • 27 Oct 2021 • Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, Haizhou Li
Multimodal emotion recognition study is hindered by the lack of labelled corpora in terms of scale and diversity, due to the high annotation cost and label ambiguity.
no code implementations • 20 Oct 2021 • Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li
Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and emotional style.
3 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
no code implementations • 13 Oct 2021 • Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li
At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
1 code implementation • 8 Oct 2021 • Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li
In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals.
no code implementations • 7 Oct 2021 • Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, Haizhou Li
The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence.
1 code implementation • 7 Oct 2021 • Rui Liu, Berrak Sisman, Haizhou Li
The emotion strength of synthesized speech can be controlled flexibly using a strength descriptor, which is obtained by an emotion attribute ranking function.
no code implementations • 5 Oct 2021 • Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Thomas Friedrichs, Haizhou Li
Yet, the impact of different Pr-LMs on the performance of automatic metrics is not well-understood.
1 code implementation • EMNLP 2021 • Yiming Chen, Yan Zhang, Chen Zhang, Grandee Lee, Ran Cheng, Haizhou Li
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
1 code implementation • 3 Oct 2021 • Yi Ma, Kong Aik Lee, Ville Hautamaki, Haizhou Li
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
1 code implementation • 30 Sep 2021 • Zexu Pan, Meng Ge, Haizhou Li
The speaker extraction algorithm requires an auxiliary reference, such as a video recording or a pre-recorded speech, to form top-down auditory attention on the target speaker.
no code implementations • 28 Sep 2021 • Bidisha Sharma, Maulik Madhavi, Xuehao Zhou, Haizhou Li
In particular, we use synthesized speech generated from an English-Mandarin text corpus for analysis and training of a multi-lingual intent classification model.
1 code implementation • 5 Aug 2021 • Yidi Jiang, Bidisha Sharma, Maulik Madhavi, Haizhou Li
In this regard, we leverage the reliable and widely used bidirectional encoder representations from transformers (BERT) model as a language model and transfer the knowledge to build an acoustic model for intent classification using the speech.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+7
1 code implementation • ACL 2021 • Yan Zhang, Ruidan He, Zuozhu Liu, Lidong Bing, Haizhou Li
As high-quality labeled data is scarce, unsupervised sentence representation learning has attracted much attention.
3 code implementations • 14 Jul 2021 • Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers.
no code implementations • 14 Jul 2021 • Hongning Zhu, Kong Aik Lee, Haizhou Li
Instead of utilizing multi-head attention in parallel, the proposed serialized multi-layer multi-head attention is designed to aggregate and propagate attentive statistics from one layer to the next in a serialized manner.
no code implementations • 8 Jul 2021 • Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li
Traditional voice conversion(VC) has been focused on speaker identity conversion for speech with a neutral expression.
1 code implementation • 14 Jun 2021 • Zexu Pan, Ruijie Tao, Chenglin Xu, Haizhou Li
A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-talker speech mixture when given a cue that represents the target speaker, such as a pre-enrolled speech utterance, or an accompanying video track.
1 code implementation • ACL 2021 • Chen Zhang, Yiming Chen, Luis Fernando D'Haro, Yan Zhang, Thomas Friedrichs, Grandee Lee, Haizhou Li
Effective evaluation metrics should reflect the dynamics of such interaction.
1 code implementation • The ActivityNet Large-Scale Activity Recognition Challenge Workshop, CVPR 2021 • Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers.
1 code implementation • 31 May 2021 • Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li
In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases.
no code implementations • 5 Apr 2021 • Qicong Xie, Xiaohai Tian, Guanghou Liu, Kun Song, Lei Xie, Zhiyong Wu, Hai Li, Song Shi, Haizhou Li, Fen Hong, Hui Bu, Xin Xu
The challenge consists of two tracks, namely few-shot track and one-shot track, where the participants are required to clone multiple target voices with 100 and 5 samples respectively.
no code implementations • 3 Apr 2021 • Rui Liu, Berrak Sisman, Haizhou Li
To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.
2 code implementations • 31 Mar 2021 • Kun Zhou, Berrak Sisman, Haizhou Li
In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech.
1 code implementation • 30 Mar 2021 • Chenglin Xu, Wei Rao, Jibin Wu, Haizhou Li
Inspired by the study on target speaker extraction, e. g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker.
no code implementations • 15 Feb 2021 • Bidisha Sharma, Maulik Madhavi, Haizhou Li
An intent classification system is usually implemented as a pipeline process, with a speech recognition module followed by text processing that classifies the intents.
no code implementations • 19 Nov 2020 • Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li
Speaker extraction requires a sample speech from the target speaker as the reference.
no code implementations • 3 Nov 2020 • Kun Zhou, Berrak Sisman, Haizhou Li
Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity.
2 code implementations • 28 Oct 2020 • Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
no code implementations • 23 Oct 2020 • Rui Liu, Berrak Sisman, Haizhou Li
Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways.
no code implementations • 15 Oct 2020 • Zexu Pan, Ruijie Tao, Chenglin Xu, Haizhou Li
Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention.
no code implementations • 20 Aug 2020 • Tianchi Liu, Rohan Kumar Das, Maulik Madhavi, ShengMei Shen, Haizhou Li
The proposed SUDA features an attention mask mechanism to learn the interaction between the speaker and utterance information streams.
no code implementations • 11 Aug 2020 • Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li
We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks.
no code implementations • 11 Aug 2020 • Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li
It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion.
no code implementations • 10 Aug 2020 • Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li
We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content.
no code implementations • 7 Jul 2020 • Zihan Pan, Malu Zhang, Jibin Wu, Haizhou Li
Inspired by the mammal's auditory localization pathway, in this paper we propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment, and implement this algorithm in a real-time robotic system with a microphone array.
no code implementations • 2 Jul 2020 • Jibin Wu, Cheng-Lin Xu, Daquan Zhou, Haizhou Li, Kay Chen Tan
In this paper, we propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition, which is referred to as progressive tandem learning of deep SNNs.
no code implementations • ACL 2020 • Gr Lee, ee, Haizhou Li
A bilingual language model is expected to model the sequential dependency for words across languages, which is difficult due to the inherent lack of suitable training data as well as diverse syntactic structure across languages.
no code implementations • 3 Jun 2020 • Srivatsa P, Kyle Timothy Ng Chu, Burin Amornpaisannon, Yaswanth Tavva, Venkata Pavan Kumar Miriyala, Jibin Wu, Malu Zhang, Haizhou Li, Trevor E. Carlson
Rate-encoded SNNs could be seen as inefficient as an encoding scheme because it involves the transmission of a large number of spikes.
1 code implementation • 13 May 2020 • Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li
We consider that there is a common code between speakers for emotional expression in a spoken language, therefore, a speaker-independent mapping between emotional states is possible.
no code implementations • 10 May 2020 • Meng Ge, Cheng-Lin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li
To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+.
Audio and Speech Processing Sound
no code implementations • 29 Apr 2020 • Cheng-Lin Xu, Wei Rao, Eng Siong Chng, Haizhou Li
The inaccuracy of phase estimation is inherent to the frequency domain processing, that affects the quality of signal reconstruction.
Audio and Speech Processing Sound
1 code implementation • 17 Apr 2020 • Cheng-Lin Xu, Wei Rao, Eng Siong Chng, Haizhou Li
Inspired by Conv-TasNet, we propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.
no code implementations • 26 Mar 2020 • Malu Zhang, Jiadong Wang, Burin Amornpaisannon, Zhixuan Zhang, VPK Miriyala, Ammar Belatreche, Hong Qu, Jibin Wu, Yansong Chua, Trevor E. Carlson, Haizhou Li
In STDBP algorithm, the timing of individual spikes is used to convey information (temporal coding), and learning (back-propagation) is performed based on spike timing in an event-driven manner.
no code implementations • 2 Feb 2020 • Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li
To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features.
1 code implementation • 1 Feb 2020 • Kun Zhou, Berrak Sisman, Haizhou Li
Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
no code implementations • 25 Nov 2019 • Van Tung Pham, Hai-Hua Xu, Yerbolat Khassanov, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li
To address this problem, in this work, we propose a new architecture that separates the decoder subnet from the encoder output.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 19 Nov 2019 • Jibin Wu, Emre Yilmaz, Malu Zhang, Haizhou Li, Kay Chen Tan
The brain-inspired spiking neural networks (SNN) closely mimic the biological neural networks and can operate on low-power neuromorphic hardware with spike-based computation.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 7 Nov 2019 • Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, Haizhou Li
We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model.
no code implementations • 27 Sep 2019 • Xianghu Yue, Grandee Lee, Emre Yilmaz, Fang Deng, Haizhou Li
In this work, we describe an E2E ASR pipeline for the recognition of CS speech in which a low-resourced language is mixed with a high resourced language.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 23 Sep 2019 • Chitralekha Gupta, Emre Yilmaz, Haizhou Li
Automatic lyrics alignment and transcription in polyphonic music are challenging tasks because the singing vocals are corrupted by the background music.
Audio and Speech Processing Sound
no code implementations • 12 Sep 2019 • Zihan Pan, Jibin Wu, Yansong Chua, Malu Zhang, Haizhou Li
We show that, with population neural codings, the encoded patterns are linearly separable using the Support Vector Machine (SVM).
no code implementations • 3 Sep 2019 • Zihan Pan, Yansong Chua, Jibin Wu, Malu Zhang, Haizhou Li, Eliathamby Ambikairajah
The neural encoding scheme, that we call Biologically plausible Auditory Encoding (BAE), emulates the functions of the perceptual components of the human auditory system, that include the cochlear filter bank, the inner hair cells, auditory masking effects from psychoacoustic models, and the spike neural encoding by the auditory nerve.
1 code implementation • 2 Jul 2019 • Jibin Wu, Yansong Chua, Malu Zhang, Guoqi Li, Haizhou Li, Kay Chen Tan
Spiking neural networks (SNNs) represent the most prominent biologically inspired computing model for neuromorphic computing (NC) architectures.
no code implementations • 25 Jun 2019 • Chitralekha Gupta, Emre Yilmaz, Haizhou Li
In this work, we propose (1) using additional speech and music-informed features and (2) adapting the acoustic models trained on a large amount of solo singing vocals towards polyphonic music using a small amount of in-domain data.
no code implementations • 19 Jun 2019 • Emre Yilmaz, Adem Derinel, Zhou Kun, Henk van den Heuvel, Niko Brummer, Haizhou Li, David A. van Leeuwen
This paper describes our initial efforts to build a large-scale speaker diarization (SD) and identification system on a recently digitized radio broadcast archive from the Netherlands which has more than 6500 audio tapes with 3000 hours of Frisian-Dutch speech recorded between 1950-2016.
no code implementations • 19 Jun 2019 • Qinyi Wang, Emre Yilmaz, Adem Derinel, Haizhou Li
Code-switching (CS) detection refers to the automatic detection of language switches in code-mixed utterances.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 18 Jun 2019 • Emre Yilmaz, Samuel Cohen, Xianghu Yue, David van Leeuwen, Haizhou Li
This archive contains recordings with monolingual Frisian and Dutch speech segments as well as Frisian-Dutch CS speech, hence the recognition performance on monolingual segments is also vital for accurate transcriptions.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 27 May 2019 • Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, Satoshi Nakamura
Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.
no code implementations • 16 Apr 2019 • Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, Jing Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda, Trung Ngo Trong, Md Sahidullah, Fan Lu, Yun Tang, Ming Tu, Kah Kuan Teh, Huy Dat Tran, Kuruvachan K. George, Ivan Kukanov, Florent Desnous, Jichen Yang, Emre Yilmaz, Longting Xu, Jean-Francois Bonastre, Cheng-Lin Xu, Zhi Hao Lim, Eng Siong Chng, Shivesh Ranjan, John H. L. Hansen, Massimiliano Todisco, Nicholas Evans
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE).
no code implementations • 29 Mar 2019 • Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi
We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks.
1 code implementation • 24 Mar 2019 • Cheng-Lin Xu, Wei Rao, Eng Siong Chng, Haizhou Li
The SpeakerBeam-FE (SBF) method is proposed for speaker extraction.
no code implementations • 15 Feb 2019 • Jibin Wu, Yansong Chua, Malu Zhang, Qu Yang, Guoqi Li, Haizhou Li
Deep spiking neural networks (SNNs) support asynchronous event-driven computation, massive parallelism and demonstrate great potential to improve the energy efficiency of its synchronous analog counterpart.
1 code implementation • 1 Nov 2018 • Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham, Hai-Hua Xu, Eng Siong Chng, Haizhou Li
Code-switching (CS) refers to a linguistic phenomenon where a speaker uses different languages in an utterance or between alternating utterances.
no code implementations • 17 Sep 2018 • Longting Xu, Rohan Kumar Das, Emre Yilmaz, Jichen Yang, Haizhou Li
Speaker verification (SV) systems using deep neural network embeddings, so-called the x-vector systems, are becoming popular due to its good performance superior to the i-vector systems.
no code implementations • 3 Jul 2018 • Laxmi R. Iyer, Yansong Chua, Haizhou Li
We also use this SNN for further experiments on N-MNIST to show that rate based SNNs perform better, and precise spike timings are not important in N-MNIST.
no code implementations • WS 2018 • Nancy Chen, Xiangyu Duan, Min Zhang, Rafael E. Banchs, Haizhou Li
Transliteration is defined as phonetic translation of names across languages.
no code implementations • WS 2018 • Zhongwei Li, Xuancong Wang, Ai Ti Aw, Eng Siong Chng, Haizhou Li
Customized translation need pay spe-cial attention to the target domain ter-minology especially the named-entities for the domain.
no code implementations • WS 2018 • Nancy Chen, Rafael E. Banchs, Min Zhang, Xiangyu Duan, Haizhou Li
This report presents the results from the Named Entity Transliteration Shared Task conducted as part of The Seventh Named Entities Workshop (NEWS 2018) held at ACL 2018 in Melbourne, Australia.
no code implementations • 10 Jun 2018 • Yougen Yuan, Cheung-Chi Leung, Lei Xie, Hongjie Chen, Bin Ma, Haizhou Li
We also find that it is important to have sufficient speech segment pairs to train the deep CNN for effective acoustic word embeddings.
no code implementations • 30 Apr 2018 • Chong Zhang, Geok Soon Hong, Jun-Hong Zhou, Kay Chen Tan, Haizhou Li, Huan Xu, Jihoon Hong, Hian-Leng Chan
For fault diagnosis, a cost-sensitive deep belief network (namely ECS-DBN) is applied to deal with the imbalanced data problem for tool state estimation.
no code implementations • 28 Apr 2018 • Chong Zhang, Kay Chen Tan, Haizhou Li, Geok Soon Hong
Adaptive differential evolution optimization is implemented as the optimization algorithm that automatically updates its corresponding parameters without the need of prior domain knowledge.
4 code implementations • 6 Jul 2017 • Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dong-Yan Huang, Haizhou Li
In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN).
Sound
no code implementations • 9 Feb 2016 • Xiaohai Tian, Zhizheng Wu, Xiong Xiao, Eng Siong Chng, Haizhou Li
To simulate the real-life scenarios, we perform a preliminary investigation of spoofing detection under additive noisy conditions, and also describe an initial database for this task.
no code implementations • 5 Feb 2016 • Kong Aik Lee, Ville Hautamäki, Anthony Larcher, Wei Rao, Hanwu Sun, Trung Hieu Nguyen, Guangsen Wang, Aleksandr Sizov, Ivan Kukanov, Amir Poorjam, Trung Ngo Trong, Xiong Xiao, Cheng-Lin Xu, Hai-Hua Xu, Bin Ma, Haizhou Li, Sylvain Meignier
This article describes the systems jointly submitted by Institute for Infocomm (I$^2$R), the Laboratoire d'Informatique de l'Universit\'e du Maine (LIUM), Nanyang Technology University (NTU) and the University of Eastern Finland (UEF) for 2015 NIST Language Recognition Evaluation (LRE).
no code implementations • MediaEval 2015 Workshop 2015 • Jingyong Hou, Van Tung Pham, Cheung-Chi Leung, Lei Wang, HaiHua Xu, Hang Lv, Lei Xie, Zhonghua Fu, Chongjia Ni, Xiong Xiao, Hongjie Chen, Shaofei Zhang, Sining Sun, Yougen Yuan, Pengcheng Li, Tin Lay Nwe, Sunil Sivadas, Bin Ma, Eng Siong Chng, Haizhou Li
This paper describes the system developed by the NNI team for the Query-by-Example Search on Speech Task (QUESST) in the MediaEval 2015 evaluation.
Ranked #9 on
Keyword Spotting
on QUESST
no code implementations • 16 Oct 2014 • Peng Yang, HaiHua Xu, Xiong Xiao, Lei Xie, Cheung-Chi Leung, Hongjie Chen, JIA YU, Hang Lv, Lei Wang, Su Jun Leow, Bin Ma, Eng Siong Chng, Haizhou Li
For both symbolic and DTW search, partial sequence matching is performed to reduce missing rate, especially for query type 2 and 3.
Ranked #6 on
Keyword Spotting
on QUESST