Search Results for author: Zhong Meng

Found 38 papers, 3 papers with code

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

no code implementations30 Mar 2022 Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency.

Automatic Speech Recognition Speaker Diarization +1

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

no code implementations2 Feb 2022 Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR).

Automatic Speech Recognition

Continuous Speech Separation with Recurrent Selective Attention Network

no code implementations28 Oct 2021 Yixuan Zhang, Zhuo Chen, Jian Wu, Takuya Yoshioka, Peidong Wang, Zhong Meng, Jinyu Li

In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting.

Speech Recognition Speech Separation

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

no code implementations7 Oct 2021 Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

Similar to the target-speaker voice activity detection (TS-VAD)-based diarization method, the E2E SA-ASR model is applied to estimate speech activity of each speaker while it has the advantages of (i) handling unlimited number of speakers, (ii) leveraging linguistic information for speaker diarization, and (iii) simultaneously generating speaker-attributed transcriptions.

Action Detection Activity Detection +3

Factorized Neural Transducer for Efficient Language Model Adaptation

no code implementations27 Sep 2021 Xie Chen, Zhong Meng, Sarangarajan Parthasarathy, Jinyu Li

In recent years, end-to-end (E2E) based automatic speech recognition (ASR) systems have achieved great success due to their simplicity and promising performance.

Automatic Speech Recognition

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

no code implementations6 Jul 2021 Naoyuki Kanda, Xiong Xiao, Jian Wu, Tianyan Zhou, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

Our evaluation on the AMI meeting corpus reveals that after fine-tuning with a small real data, the joint system performs 8. 9--29. 9% better in accuracy compared to the best modular system while the modular system performs better before such fine-tuning.

Automatic Speech Recognition Representation Learning +2

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

no code implementations4 Jun 2021 Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong

In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference.

Speech Recognition

End-to-End Speaker-Attributed ASR with Transformer

no code implementations5 Apr 2021 Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio.

Automatic Speech Recognition Speaker Identification

Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone

no code implementations31 Mar 2021 Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR).

Automatic Speech Recognition

Continuous Speech Separation with Ad Hoc Microphone Arrays

no code implementations3 Mar 2021 Dongmei Wang, Takuya Yoshioka, Zhuo Chen, Xiaofei Wang, Tianyan Zhou, Zhong Meng

Prior studies show, with a spatial-temporalinterleaving structure, neural networks can efficiently utilize the multi-channel signals of the ad hoc array.

Speech Recognition Speech Separation

Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

no code implementations2 Feb 2021 Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, Jinyu Li, Yifan Gong

The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method.

Automatic Speech Recognition

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

no code implementations3 Nov 2020 Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong

The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models.

Automatic Speech Recognition

Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR

1 code implementation3 Nov 2020 Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo Chen, Takuya Yoshioka

Recently, an end-to-end speaker-attributed automatic speech recognition (E2E SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech.

Automatic Speech Recognition Speaker Identification

On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

no code implementations23 Oct 2020 Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, Yifan Gong

Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end acoustic model that extends the standard Recurrent Neural Network Transducer (RNN-T) for the purpose of the external language model (LM) fusion.

Speech Recognition

Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings

1 code implementation11 Aug 2020 Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

However, the model required prior knowledge of speaker profiles to perform speaker identification, which significantly limited the application of the model.

Automatic Speech Recognition Speaker Identification

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

no code implementations30 Jul 2020 Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong

Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.

Automatic Speech Recognition

Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers

no code implementations19 Jun 2020 Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka

We propose an end-to-end speaker-attributed automatic speech recognition model that unifies speaker counting, speech recognition, and speaker identification on monaural overlapped speech.

Automatic Speech Recognition Speaker Identification

L-Vector: Neural Label Embedding for Domain Adaptation

no code implementations25 Apr 2020 Zhong Meng, Hu Hu, Jinyu Li, Changliang Liu, Yan Huang, Yifan Gong, Chin-Hui Lee

We propose a novel neural label embedding (NLE) scheme for the domain adaptation of a deep neural network (DNN) acoustic model with unpaired data samples from source and target domains.

Domain Adaptation

Active Voice Authentication

no code implementations25 Apr 2020 Zhong Meng, M Umair Bin Altaf, Biing-Hwang, Juang

In our off-line evaluation on this dataset, the system achieves an average windowed-based equal error rates of 3-4% depending on the model configuration, which is remarkable considering that only 1 second of voice data is used to make every single authentication decision.

Speaker Verification

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

no code implementations17 Mar 2020 Jinyu Li, Rui Zhao, Eric Sun, Jeremy H. M. Wong, Amit Das, Zhong Meng, Yifan Gong

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved.

Automatic Speech Recognition

Continuous speech separation: dataset and analysis

1 code implementation30 Jan 2020 Zhuo Chen, Takuya Yoshioka, Liang Lu, Tianyan Zhou, Zhong Meng, Yi Luo, Jian Wu, Xiong Xiao, Jinyu Li

In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree.

Automatic Speech Recognition Speech Separation

Character-Aware Attention-Based End-to-End Speech Recognition

no code implementations6 Jan 2020 Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong

However, as one input to the decoder recurrent neural network (RNN), each WSU embedding is learned independently through context and acoustic information in a purely data-driven fashion.

Speech Recognition

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

no code implementations6 Jan 2020 Zhong Meng, Jinyu Li, Yashesh Gaur, Yifan Gong

In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance.

Speech Recognition Transfer Learning +1

Speaker Adaptation for Attention-Based End-to-End Speech Recognition

no code implementations9 Nov 2019 Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong

We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition.

Automatic Speech Recognition Multi-Task Learning

Adversarial Speaker Verification

no code implementations29 Apr 2019 Zhong Meng, Yong Zhao, Jinyu Li, Yifan Gong

The use of deep networks to extract embeddings for speaker recognition has proven successfully.

General Classification Speaker Recognition +1

Adversarial Speaker Adaptation

no code implementations29 Apr 2019 Zhong Meng, Jinyu Li, Yifan Gong

We propose a novel adversarial speaker adaptation (ASA) scheme, in which adversarial learning is applied to regularize the distribution of deep hidden features in a speaker-dependent (SD) deep neural network (DNN) acoustic model to be close to that of a fixed speaker-independent (SI) DNN acoustic model during adaptation.

Automatic Speech Recognition

Conditional Teacher-Student Learning

no code implementations28 Apr 2019 Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong

To overcome this problem, we propose a conditional T/S learning scheme, in which a "smart" student model selectively chooses to learn from either the teacher model or the ground truth labels conditioned on whether the teacher can correctly predict the ground truth.

Domain Adaptation Model Compression

Attentive Adversarial Learning for Domain-Invariant Training

no code implementations28 Apr 2019 Zhong Meng, Jinyu Li, Yifan Gong

Adversarial domain-invariant training (ADIT) proves to be effective in suppressing the effects of domain variability in acoustic modeling and has led to improved performance in automatic speech recognition (ASR).

Automatic Speech Recognition

Adversarial Feature-Mapping for Speech Enhancement

no code implementations6 Sep 2018 Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

To achieve better performance on ASR task, senone-aware (SA) AFM is further proposed in which an acoustic model network is jointly trained with the feature-mapping and discriminator networks to optimize the senone classification loss in addition to the AFM losses.

Speech Enhancement

Cycle-Consistent Speech Enhancement

no code implementations6 Sep 2018 Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

In this paper, we propose a cycle-consistent speech enhancement (CSE) in which an additional inverse mapping network is introduced to reconstruct the noisy features from the enhanced ones.

Multi-Task Learning Speech Enhancement

Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation

no code implementations2 Apr 2018 Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

In this method, a student acoustic model and a condition classifier are jointly optimized to minimize the Kullback-Leibler divergence between the output distributions of the teacher and student models, and simultaneously, to min-maximize the condition classification loss.

Transfer Learning Unsupervised Domain Adaptation

Speaker-Invariant Training via Adversarial Learning

no code implementations2 Apr 2018 Zhong Meng, Jinyu Li, Zhuo Chen, Yong Zhao, Vadim Mazalov, Yifan Gong, Biing-Hwang, Juang

We propose a novel adversarial multi-task learning scheme, aiming at actively curtailing the inter-talker feature variability while maximizing its senone discriminability so as to enhance the performance of a deep neural network (DNN) based ASR system.

General Classification Multi-Task Learning

Unsupervised Adaptation with Domain Separation Networks for Robust Speech Recognition

no code implementations21 Nov 2017 Zhong Meng, Zhuo Chen, Vadim Mazalov, Jinyu Li, Yifan Gong

Unsupervised domain adaptation of speech signal aims at adapting a well-trained source-domain acoustic model to the unlabeled data from target domain.

Automatic Speech Recognition General Classification +2

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

no code implementations21 Nov 2017 Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan

Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients.

Robust Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.