Search Results for author: Haizhou Li

Found 95 papers, 21 papers with code

HLT-NUS SUBMISSION FOR 2020 NIST Conversational Telephone Speech SRE

1 code implementation12 Nov 2021 Rohan Kumar Das, Ruijie Tao, Haizhou Li

This work provides a brief description of Human Language Technology (HLT) Laboratory, National University of Singapore (NUS) system submission for 2020 NIST conversational telephone speech (CTS) speaker recognition evaluation (SRE).

Domain Adaptation Speaker Recognition

MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition

no code implementations27 Oct 2021 Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, Haizhou Li

Multimodal emotion recognition study is hindered by the lack of labelled corpora in terms of scale and diversity, due to the high annotation cost and label ambiguity.

Emotion Classification Multimodal Emotion Recognition +1

Identity Conversion for Emotional Speakers: A Study for Disentanglement of Emotion Style and Speaker Identity

no code implementations20 Oct 2021 Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li

Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and speaker-dependent emotion style.

Hierarchical structure Voice Conversion

DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding

no code implementations13 Oct 2021 Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li

At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.

Speech Synthesis Voice Conversion

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

no code implementations7 Oct 2021 Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, Haizhou Li

The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence.

Speech Synthesis

StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis

1 code implementation7 Oct 2021 Rui Liu, Berrak Sisman, Haizhou Li

The emotion strength of synthesized speech can be controlled flexibly using a strength descriptor, which is obtained by an emotion attribute ranking function.

Data Augmentation Emotional Speech Synthesis +1

Revisiting Self-Training for Few-Shot Learning of Language Model

1 code implementation EMNLP 2021 Yiming Chen, Yan Zhang, Chen Zhang, Grandee Lee, Ran Cheng, Haizhou Li

In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.

Few-Shot Learning Fine-tuning +3

PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction

1 code implementation3 Oct 2021 Yi Ma, Kong Aik Lee, Ville Hautamaki, Haizhou Li

Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.

Speaker Identification Speaker Verification +1

USEV: Universal Speaker Extraction with Visual Cue

1 code implementation30 Sep 2021 Zexu Pan, Meng Ge, Haizhou Li

In this paper, we propose a universal speaker extraction network that works for all multi-talker scenarios, where the target speaker can be either absent or present.

Exploring Teacher-Student Learning Approach for Multi-lingual Speech-to-Intent Classification

no code implementations28 Sep 2021 Bidisha Sharma, Maulik Madhavi, Xuehao Zhou, Haizhou Li

In particular, we use synthesized speech generated from an English-Mandarin text corpus for analysis and training of a multi-lingual intent classification model.

Classification Intent Classification

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

1 code implementation5 Aug 2021 Yidi Jiang, Bidisha Sharma, Maulik Madhavi, Haizhou Li

In this regard, we leverage the reliable and widely used bidirectional encoder representations from transformers (BERT) model as a language model and transfer the knowledge to build an acoustic model for intent classification using the speech.

automatic-speech-recognition Classification +6

Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding

no code implementations14 Jul 2021 Hongning Zhu, Kong Aik Lee, Haizhou Li

Instead of utilizing multi-head attention in parallel, the proposed serialized multi-layer multi-head attention is designed to aggregate and propagate attentive statistics from one layer to the next in a serialized manner.

Text-Independent Speaker Verification

Selective Hearing through Lip-reading

1 code implementation14 Jun 2021 Zexu Pan, Ruijie Tao, Chenglin Xu, Haizhou Li

Speaker extraction algorithm emulates human's ability of selective attention to extract the target speaker's speech from a multi-talker scenario.

Lip Reading

Emotional Voice Conversion: Theory, Databases and ESD

no code implementations31 May 2021 Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases.

Voice Conversion

The Multi-speaker Multi-style Voice Cloning Challenge 2021

no code implementations5 Apr 2021 Qicong Xie, Xiaohai Tian, Guanghou Liu, Kun Song, Lei Xie, Zhiyong Wu, Hai Li, Song Shi, Haizhou Li, Fen Hong, Hui Bu, Xin Xu

The challenge consists of two tracks, namely few-shot track and one-shot track, where the participants are required to clone multiple target voices with 100 and 5 samples respectively.

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

1 code implementation31 Mar 2021 Kun Zhou, Berrak Sisman, Haizhou Li

In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech.

Voice Conversion

Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech

1 code implementation30 Mar 2021 Chenglin Xu, Wei Rao, Jibin Wu, Haizhou Li

Inspired by the study on target speaker extraction, e. g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker.

Multi-Task Learning Speaker Verification

Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and language Models for Intent Classification

no code implementations15 Feb 2021 Bidisha Sharma, Maulik Madhavi, Haizhou Li

An intent classification system is usually implemented as a pipeline process, with a speech recognition module followed by text processing that classifies the intents.

Classification General Classification +6

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

no code implementations3 Nov 2020 Kun Zhou, Berrak Sisman, Haizhou Li

Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity.

Voice Conversion

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

2 code implementations28 Oct 2020 Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.

Speech Emotion Recognition Style Transfer +1

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

no code implementations23 Oct 2020 Rui Liu, Berrak Sisman, Haizhou Li

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways.

Graph Attention Speech Synthesis +1

Muse: Multi-modal target speaker extraction with visual cues

no code implementations15 Oct 2020 Zexu Pan, Ruijie Tao, Chenglin Xu, Haizhou Li

Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention.

Speaker-Utterance Dual Attention for Speaker and Utterance Verification

no code implementations20 Aug 2020 Tianchi Liu, Rohan Kumar Das, Maulik Madhavi, ShengMei Shen, Haizhou Li

The proposed SUDA features an attention mask mechanism to learn the interaction between the speaker and utterance information streams.

Speaker Verification

Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN

no code implementations11 Aug 2020 Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li

It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion.

Voice Conversion

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

no code implementations11 Aug 2020 Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks.

Multi-Task Learning Speech Synthesis

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

no code implementations10 Aug 2020 Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li

We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content.

Voice Conversion

Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by Spiking Neural Network

no code implementations7 Jul 2020 Zihan Pan, Malu Zhang, Jibin Wu, Haizhou Li

Inspired by the mammal's auditory localization pathway, in this paper we propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment, and implement this algorithm in a real-time robotic system with a microphone array.

Progressive Tandem Learning for Pattern Recognition with Deep Spiking Neural Networks

no code implementations2 Jul 2020 Jibin Wu, Cheng-Lin Xu, Daquan Zhou, Haizhou Li, Kay Chen Tan

In this paper, we propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition, which is referred to as progressive tandem learning of deep SNNs.

Image Reconstruction Object Recognition +1

Modeling Code-Switch Languages Using Bilingual Parallel Corpus

no code implementations ACL 2020 Gr Lee, ee, Haizhou Li

A bilingual language model is expected to model the sequential dependency for words across languages, which is difficult due to the inherent lack of suitable training data as well as diverse syntactic structure across languages.

Bilingual Lexicon Induction Language Modelling +1

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

1 code implementation13 May 2020 Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li

We consider that there is a common code between speakers for emotional expression in a spoken language, therefore, a speaker-independent mapping between emotional states is possible.

Voice Conversion

SpEx+: A Complete Time Domain Speaker Extraction Network

no code implementations10 May 2020 Meng Ge, Cheng-Lin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+.

Audio and Speech Processing Sound

Time-domain speaker extraction network

no code implementations29 Apr 2020 Cheng-Lin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

The inaccuracy of phase estimation is inherent to the frequency domain processing, that affects the quality of signal reconstruction.

Audio and Speech Processing Sound

SpEx: Multi-Scale Time Domain Speaker Extraction Network

1 code implementation17 Apr 2020 Cheng-Lin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

Inspired by Conv-TasNet, we propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.

Multi-Task Learning Speech Quality

Rectified Linear Postsynaptic Potential Function for Backpropagation in Deep Spiking Neural Networks

no code implementations26 Mar 2020 Malu Zhang, Jiadong Wang, Burin Amornpaisannon, Zhixuan Zhang, VPK Miriyala, Ammar Belatreche, Hong Qu, Jibin Wu, Yansong Chua, Trevor E. Carlson, Haizhou Li

In STDBP algorithm, the timing of individual spikes is used to convey information (temporal coding), and learning (back-propagation) is performed based on spike timing in an event-driven manner.

Decision Making

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

no code implementations2 Feb 2020 Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features.

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

1 code implementation1 Feb 2020 Kun Zhou, Berrak Sisman, Haizhou Li

Many studies require parallel speech data between different emotional patterns, which is not practical in real life.

Voice Conversion

Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition

1 code implementation19 Nov 2019 Jibin Wu, Emre Yilmaz, Malu Zhang, Haizhou Li, Kay Chen Tan

The brain-inspired spiking neural networks (SNN) closely mimic the biological neural networks and can operate on low-power neuromorphic hardware with spike-based computation.

automatic-speech-recognition Speech Recognition

Teacher-Student Training for Robust Tacotron-based TTS

no code implementations7 Nov 2019 Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, Haizhou Li

We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model.

Knowledge Distillation

End-to-End Code-Switching ASR for Low-Resourced Language Pairs

no code implementations27 Sep 2019 Xianghu Yue, Grandee Lee, Emre Yilmaz, Fang Deng, Haizhou Li

In this work, we describe an E2E ASR pipeline for the recognition of CS speech in which a low-resourced language is mixed with a high resourced language.

automatic-speech-recognition Language Modelling +1

Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?

no code implementations23 Sep 2019 Chitralekha Gupta, Emre Yilmaz, Haizhou Li

Automatic lyrics alignment and transcription in polyphonic music are challenging tasks because the singing vocals are corrupted by the background music.

Audio and Speech Processing Sound

Neural Population Coding for Effective Temporal Classification

no code implementations12 Sep 2019 Zihan Pan, Jibin Wu, Yansong Chua, Malu Zhang, Haizhou Li

We show that, with population neural codings, the encoded patterns are linearly separable using the Support Vector Machine (SVM).

Classification General Classification

An efficient and perceptually motivated auditory neural encoding and decoding algorithm for spiking neural networks

no code implementations3 Sep 2019 Zihan Pan, Yansong Chua, Jibin Wu, Malu Zhang, Haizhou Li, Eliathamby Ambikairajah

The neural encoding scheme, that we call Biologically plausible Auditory Encoding (BAE), emulates the functions of the perceptual components of the human auditory system, that include the cochlear filter bank, the inner hair cells, auditory masking effects from psychoacoustic models, and the spike neural encoding by the auditory nerve.

Speech Recognition

A Tandem Learning Rule for Effective Training and Rapid Inference of Deep Spiking Neural Networks

no code implementations2 Jul 2019 Jibin Wu, Yansong Chua, Malu Zhang, Guoqi Li, Haizhou Li, Kay Chen Tan

Spiking neural networks (SNNs) represent the most prominent biologically inspired computing model for neuromorphic computing (NC) architectures.

Event-based vision

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

no code implementations25 Jun 2019 Chitralekha Gupta, Emre Yilmaz, Haizhou Li

In this work, we propose (1) using additional speech and music-informed features and (2) adapting the acoustic models trained on a large amount of solo singing vocals towards polyphonic music using a small amount of in-domain data.

Code-Switching Detection Using ASR-Generated Language Posteriors

no code implementations19 Jun 2019 Qinyi Wang, Emre Yilmaz, Adem Derinel, Haizhou Li

Code-switching (CS) detection refers to the automatic detection of language switches in code-mixed utterances.

automatic-speech-recognition Speech Recognition

Large-Scale Speaker Diarization of Radio Broadcast Archives

no code implementations19 Jun 2019 Emre Yilmaz, Adem Derinel, Zhou Kun, Henk van den Heuvel, Niko Brummer, Haizhou Li, David A. van Leeuwen

This paper describes our initial efforts to build a large-scale speaker diarization (SD) and identification system on a recently digitized radio broadcast archive from the Netherlands which has more than 6500 audio tapes with 3000 hours of Frisian-Dutch speech recorded between 1950-2016.

Speaker Diarization Speaker Identification

Multi-Graph Decoding for Code-Switching ASR

no code implementations18 Jun 2019 Emre Yilmaz, Samuel Cohen, Xianghu Yue, David van Leeuwen, Haizhou Li

This archive contains recordings with monolingual Frisian and Dutch speech segments as well as Frisian-Dutch CS speech, hence the recognition performance on monolingual segments is also vital for accurate transcriptions.

automatic-speech-recognition Language Modelling +1

VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

no code implementations27 May 2019 Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, Satoshi Nakamura

Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.

Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

no code implementations29 Mar 2019 Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi

We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks.

Speech Synthesis Voice Conversion

Deep Spiking Neural Network with Spike Count based Learning Rule

no code implementations15 Feb 2019 Jibin Wu, Yansong Chua, Malu Zhang, Qu Yang, Guoqi Li, Haizhou Li

Deep spiking neural networks (SNNs) support asynchronous event-driven computation, massive parallelism and demonstrate great potential to improve the energy efficiency of its synchronous analog counterpart.

On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

1 code implementation1 Nov 2018 Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham, Hai-Hua Xu, Eng Siong Chng, Haizhou Li

Code-switching (CS) refers to a linguistic phenomenon where a speaker uses different languages in an utterance or between alternating utterances.

Data Augmentation Language Identification +2

Generative x-vectors for text-independent speaker verification

no code implementations17 Sep 2018 Longting Xu, Rohan Kumar Das, Emre Yilmaz, Jichen Yang, Haizhou Li

Speaker verification (SV) systems using deep neural network embeddings, so-called the x-vector systems, are becoming popular due to its good performance superior to the i-vector systems.

Text-Independent Speaker Verification

Is Neuromorphic MNIST neuromorphic? Analyzing the discriminative power of neuromorphic datasets in the time domain

no code implementations3 Jul 2018 Laxmi R. Iyer, Yansong Chua, Haizhou Li

We also use this SNN for further experiments on N-MNIST to show that rate based SNNs perform better, and precise spike timings are not important in N-MNIST.

Report of NEWS 2018 Named Entity Transliteration Shared Task

no code implementations WS 2018 Nancy Chen, Rafael E. Banchs, Min Zhang, Xiangyu Duan, Haizhou Li

This report presents the results from the Named Entity Transliteration Shared Task conducted as part of The Seventh Named Entities Workshop (NEWS 2018) held at ACL 2018 in Melbourne, Australia.

Information Retrieval Transliteration

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

no code implementations10 Jun 2018 Yougen Yuan, Cheung-Chi Leung, Lei Xie, Hongjie Chen, Bin Ma, Haizhou Li

We also find that it is important to have sufficient speech segment pairs to train the deep CNN for effective acoustic word embeddings.

Dynamic Time Warping Word Embeddings

A Multi-State Diagnosis and Prognosis Framework with Feature Learning for Tool Condition Monitoring

no code implementations30 Apr 2018 Chong Zhang, Geok Soon Hong, Jun-Hong Zhou, Kay Chen Tan, Haizhou Li, Huan Xu, Jihoon Hong, Hian-Leng Chan

For fault diagnosis, a cost-sensitive deep belief network (namely ECS-DBN) is applied to deal with the imbalanced data problem for tool state estimation.

Representation Learning

A Cost-Sensitive Deep Belief Network for Imbalanced Classification

no code implementations28 Apr 2018 Chong Zhang, Kay Chen Tan, Haizhou Li, Geok Soon Hong

Adaptive differential evolution optimization is implemented as the optimization algorithm that automatically updates its corresponding parameters without the need of prior domain knowledge.

Classification General Classification +1

Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework

4 code implementations6 Jul 2017 Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dong-Yan Huang, Haizhou Li

In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN).

Sound

Spoofing detection under noisy conditions: a preliminary investigation and an initial database

no code implementations9 Feb 2016 Xiaohai Tian, Zhizheng Wu, Xiong Xiao, Eng Siong Chng, Haizhou Li

To simulate the real-life scenarios, we perform a preliminary investigation of spoofing detection under additive noisy conditions, and also describe an initial database for this task.

Speaker Verification

Fantastic 4 system for NIST 2015 Language Recognition Evaluation

no code implementations5 Feb 2016 Kong Aik Lee, Ville Hautamäki, Anthony Larcher, Wei Rao, Hanwu Sun, Trung Hieu Nguyen, Guangsen Wang, Aleksandr Sizov, Ivan Kukanov, Amir Poorjam, Trung Ngo Trong, Xiong Xiao, Cheng-Lin Xu, Hai-Hua Xu, Bin Ma, Haizhou Li, Sylvain Meignier

This article describes the systems jointly submitted by Institute for Infocomm (I$^2$R), the Laboratoire d'Informatique de l'Universit\'e du Maine (LIUM), Nanyang Technology University (NTU) and the University of Eastern Finland (UEF) for 2015 NIST Language Recognition Evaluation (LRE).

Cannot find the paper you are looking for? You can Submit a new open access paper.