Search Results for author: Daniel Povey

Found 37 papers, 24 papers with code

On Speaker Attribution with SURT

1 code implementation28 Jan 2024 Desh Raj, Matthew Wiesner, Matthew Maciejewski, Leibny Paola Garcia-Perera, Daniel Povey, Sanjeev Khudanpur

The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR).

speech-recognition Speech Recognition

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

1 code implementation15 Sep 2023 Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey

In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50, 000 hours of read English speech derived from LibriVox.

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

1 code implementation14 Sep 2023 Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen

Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques.

Self-Supervised Learning speech-recognition +2

PromptASR for contextualized ASR with controllable style

2 code implementations14 Sep 2023 Xiaoyu Yang, Wei Kang, Zengwei Yao, Yifan Yang, Liyong Guo, Fangjun Kuang, Long Lin, Daniel Povey

An additional style prompt can be given to the text encoder and guide the ASR system to output different styles of transcriptions.

Automatic Speech Recognition speech-recognition +1

Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

no code implementations12 Aug 2023 Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan

Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens.

Automatic Speech Recognition speech-recognition +1

SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

1 code implementation18 Jun 2023 Desh Raj, Daniel Povey, Sanjeev Khudanpur

The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR).

Domain Adaptation speech-recognition +1

Blank-regularized CTC for Frame Skipping in Neural Transducer

1 code implementation19 May 2023 Yifan Yang, Xiaoyu Yang, Liyong Guo, Zengwei Yao, Wei Kang, Fangjun Kuang, Long Lin, Xie Chen, Daniel Povey

Neural Transducer and connectionist temporal classification (CTC) are popular end-to-end automatic speech recognition systems.

Automatic Speech Recognition speech-recognition +1

GPU-accelerated Guided Source Separation for Meeting Transcription

2 code implementations10 Dec 2022 Desh Raj, Daniel Povey, Sanjeev Khudanpur

In this paper, we describe our improved implementation of GSS that leverages the power of modern GPU-based pipelines, including batched processing of frequencies and segments, to provide 300x speed-up over CPU-based inference.

blind source separation Target Speaker Extraction

Fast and parallel decoding for transducer

1 code implementation31 Oct 2022 Wei Kang, Liyong Guo, Fangjun Kuang, Long Lin, Mingshuang Luo, Zengwei Yao, Xiaoyu Yang, Piotr Żelasko, Daniel Povey

In this work, we introduce a constrained version of transducer loss to learn strictly monotonic alignments between the sequences; we also improve the standard greedy search and beam search algorithms by limiting the number of symbols that can be emitted per time step in transducer decoding, making it more efficient to decode in parallel with batches.

speech-recognition Speech Recognition

Delay-penalized transducer for low-latency streaming ASR

1 code implementation31 Oct 2022 Wei Kang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Long Lin, Piotr Żelasko, Daniel Povey

In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Pruned RNN-T for fast, memory-efficient ASR training

no code implementations23 Jun 2022 Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, Daniel Povey

The RNN-Transducer (RNN-T) framework for speech recognition has been growing in popularity, particularly for deployed real-time ASR systems, because it combines high accuracy with naturally streaming recognition.

speech-recognition Speech Recognition

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

2 code implementations13 Jun 2021 Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10, 000 hours of high quality labeled audio suitable for supervised training, and 40, 000 hours of total audio suitable for semi-supervised and unsupervised training.

Sentence speech-recognition +1

speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment

2 code implementations3 Apr 2021 Junbo Zhang, Zhiwen Zhang, Yongqing Wang, Zhiyong Yan, Qiong Song, YuKai Huang, Ke Li, Daniel Povey, Yujun Wang

This paper introduces a new open-source speech corpus named "speechocean762" designed for pronunciation assessment use, consisting of 5000 English utterances from 250 non-native speakers, where half of the speakers are children.

Phone-level pronunciation scoring Sentence +1

A Parallelizable Lattice Rescoring Strategy with Neural Language Models

1 code implementation8 Mar 2021 Ke Li, Daniel Povey, Sanjeev Khudanpur

This paper proposes a parallel computation strategy and a posterior-based lattice expansion algorithm for efficient lattice rescoring with neural language models (LMs) for automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Wake Word Detection with Streaming Transformers

no code implementations8 Feb 2021 Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Modern wake word detection systems usually rely on neural networks for acoustic modeling.

DOVER-Lap: A Method for Combining Overlap-aware Diarization Outputs

1 code implementation3 Nov 2020 Desh Raj, Leibny Paola Garcia-Perera, Zili Huang, Shinji Watanabe, Daniel Povey, Andreas Stolcke, Sanjeev Khudanpur

Several advances have been made recently towards handling overlapping speech for speaker diarization.

Audio and Speech Processing Sound

PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

1 code implementation20 May 2020 Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

We present PyChain, a fully parallelized PyTorch implementation of end-to-end lattice-free maximum mutual information (LF-MMI) training for the so-called \emph{chain models} in the Kaldi automatic speech recognition (ASR) toolkit.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Wake Word Detection with Alignment-Free Lattice-Free MMI

1 code implementation17 May 2020 Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Always-on spoken language interfaces, e. g. personal digital assistants, rely on a wake word to start processing spoken input.

Speaker Diarization with Region Proposal Network

1 code implementation14 Feb 2020 Zili Huang, Shinji Watanabe, Yusuke Fujita, Paola Garcia, Yiwen Shao, Daniel Povey, Sanjeev Khudanpur

Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the "who spoke when" problem.

Region Proposal speaker-diarization +1

GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition

1 code implementation22 Oct 2019 Hugo Braun, Justin Luitjens, Ryan Leary, Tim Kaldewey, Daniel Povey

We present an optimized weighted finite-state transducer (WFST) decoder capable of online streaming and offline batch processing of audio using Graphics Processing Units (GPUs).

speech-recognition Speech Recognition

Probing the Information Encoded in X-vectors

no code implementations13 Sep 2019 Desh Raj, David Snyder, Daniel Povey, Sanjeev Khudanpur

Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks.

Data Augmentation Sentence +3

End-to-end speech recognition using lattice-free MMI

no code implementations Interspeech 2018 2018 Hossein Hadian, Hossein Sameti, Daniel Povey, Sanjeev Khudanpur

We present our work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models.

speech-recognition Speech Recognition

Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

1 code implementation Interspeech 2018 2018 Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, Sanjeev Khudanpur

Time Delay Neural Networks (TDNNs), also known as onedimensional Convolutional Neural Networks (1-d CNNs), are an efficient and well-performing neural network architecture for speech recognition.

speech-recognition Speech Recognition

A GPU-based WFST Decoder with Exact Lattice Generation

no code implementations9 Apr 2018 Zhehuai Chen, Justin Luitjens, Hainan Xu, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics Processing Units (GPUs).

Scheduling

Purely sequence-trained neural networks for ASR based on lattice-free MMI

no code implementations INTERSPEECH 2016 2016 Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahrmani, Vimal Manohar, Xingyu Na, Yiming Wang, Sanjeev Khudanpur

Models trained with LFMMI provide a relative word error rate reduction of ∼11. 5%, over those trained with cross-entropy objective function, and ∼8%, over those trained with cross-entropy and sMBR objective functions.

Language Modelling Speech Recognition

MUSAN: A Music, Speech, and Noise Corpus

2 code implementations28 Oct 2015 David Snyder, Guoguo Chen, Daniel Povey

This report introduces a new corpus of music, speech, and noise.

Sound

Parallel training of DNNs with Natural Gradient and Parameter Averaging

1 code implementation27 Oct 2014 Daniel Povey, Xiaohui Zhang, Sanjeev Khudanpur

However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.

speech-recognition Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.