Search Results for author: Yangyang Shi

Found 40 papers, 6 papers with code

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

no code implementations • 22 Feb 2024 • Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra

The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0. 7%/0. 8% than MobileLLM 125M/350M.

Paper
Add Code

Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition

no code implementations • 20 Feb 2024 • Yang Li, Yuan Shangguan, Yuhao Wang, Liangzhen Lai, Ernie Chang, Changsheng Zhao, Yangyang Shi, Vikas Chandra

This study delves into how weight parameters in speech recognition models influence the overall power consumption of these models.

speech-recognition Speech Recognition

Paper
Add Code

FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for Acoustic Echo Cancellation

no code implementations • 8 Jan 2024 • Yang Liu, Li Wan, Yun Li, Yiteng Huang, Ming Sun, James Luan, Yangyang Shi, Xin Lei

Despite the potential of diffusion models in speech enhancement, their deployment in Acoustic Echo Cancellation (AEC) has been restricted.

Acoustic echo cancellation Speech Enhancement

Paper
Add Code

On The Open Prompt Challenge In Conditional Audio Generation

no code implementations • 1 Nov 2023 • Ernie Chang, Sidd Srinivasan, Mahi Luthra, Pin-Jie Lin, Varun Nagaraja, Forrest Iandola, Zechun Liu, Zhaoheng Ni, Changsheng Zhao, Yangyang Shi, Vikas Chandra

Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text.

Audio Generation

Paper
Add Code

In-Context Prompt Editing For Conditional Audio Generation

no code implementations • 1 Nov 2023 • Ernie Chang, Pin-Jie Lin, Yang Li, Sidd Srinivasan, Gael Le Lan, David Kant, Yangyang Shi, Forrest Iandola, Vikas Chandra

We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited with reference to the training captions as exemplars.

Audio Generation Retrieval

Paper
Add Code

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

1 code implementation • 27 Oct 2023 • Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

TorchAudio is an open-source audio and speech processing library built for PyTorch.

Self-Supervised Learning Speech Enhancement +2

2,379

Paper
Code

Exploring Speech Enhancement for Low-resource Speech Synthesis

no code implementations • 19 Sep 2023 • Zhaoheng Ni, Sravya Popuri, Ning Dong, Kohei Saijo, Xiaohui Zhang, Gael Le Lan, Yangyang Shi, Vikas Chandra, Changhan Wang

High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

FoleyGen: Visually-Guided Audio Generation

no code implementations • 19 Sep 2023 • Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video.

Audio Generation Language Modelling

Paper
Add Code

Stack-and-Delay: a new codebook pattern for music generation

no code implementations • 15 Sep 2023 • Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas Chandra

In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns.

Language Modelling Music Generation

Paper
Add Code

Enhance audio generation controllability through representation similarity regularization

no code implementations • 15 Sep 2023 • Yangyang Shi, Gael Le Lan, Varun Nagaraja, Zhaoheng Ni, Xinhao Mei, Ernie Chang, Forrest Iandola, Yang Liu, Vikas Chandra

This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training.

Audio Generation Language Modelling +2

Paper
Add Code

Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition

no code implementations • 14 Sep 2023 • Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N. Iandola, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage.

speech-recognition Speech Recognition

Paper
Add Code

DISGO: Automatic End-to-End Evaluation for Scene Text OCR

no code implementations • 25 Aug 2023 • Mei-Yuh Hwang, Yangyang Shi, Ankit Ramchandani, Guan Pang, Praveen Krishnan, Lucas Kabela, Frank Seide, Samyak Datta, Jun Liu

This paper discusses the challenges of optical character recognition (OCR) on natural scenes, which is harder than OCR on documents due to the wild content and various image backgrounds.

Machine Translation Optical Character Recognition +2

Paper
Add Code

Revisiting Sample Size Determination in Natural Language Understanding

1 code implementation • 1 Jul 2023 • Ernie Chang, Muhammad Hassan Rashid, Pin-Jie Lin, Changsheng Zhao, Vera Demberg, Yangyang Shi, Vikas Chandra

Knowing exactly how many data points need to be labeled to achieve a certain model performance is a hugely beneficial step towards reducing the overall budgets for annotation.

Active Learning Natural Language Understanding

Paper
Code

Binary and Ternary Natural Language Generation

1 code implementation • 2 Jun 2023 • Zechun Liu, Barlas Oguz, Aasish Pappu, Yangyang Shi, Raghuraman Krishnamoorthi

For machine translation, we achieved BLEU scores of 21. 7 and 17. 6 on the WMT16 En-Ro benchmark, compared with a full precision mBART model score of 26. 8.

Machine Translation Quantization +2

Paper
Code

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

no code implementations • 29 May 2023 • Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits.

Data Free Quantization

Paper
Add Code

Multi-Head State Space Model for Speech Recognition

no code implementations • 21 May 2023 • Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches.

Ranked #8 on Speech Recognition on LibriSpeech test-clean

Language Modelling speech-recognition +1

Paper
Add Code

Improving Fast-slow Encoder based Transducer with Streaming Deliberation

no code implementations • 15 Dec 2022 • Ke Li, Jay Mahadeokar, Jinxi Guo, Yangyang Shi, Gil Keren, Ozlem Kalinli, Michael L. Seltzer, Duc Le

Experiments on Librispeech and in-house data show relative WER reductions (WERRs) from 3% to 5% with a slight increase in model size and negligible extra token emission latency compared with fast-slow encoder based transducer.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

LiCo-Net: Linearized Convolution Network for Hardware-efficient Keyword Spotting

no code implementations • 9 Nov 2022 • Haichuan Yang, Zhaojun Yang, Li Wan, Biqiao Zhang, Yangyang Shi, Yiteng Huang, Ivaylo Enchev, Limin Tang, Raziel Alvarez, Ming Sun, Xin Lei, Raghuraman Krishnamoorthi, Vikas Chandra

This paper proposes a hardware-efficient architecture, Linearized Convolution Network (LiCo-Net) for keyword spotting.

Keyword Spotting

Paper
Add Code

Biased Self-supervised learning for ASR

no code implementations • 4 Nov 2022 • Florian L. Kreyssig, Yangyang Shi, Jinxi Guo, Leda Sari, Abdelrahman Mohamed, Philip C. Woodland

Furthermore, this paper proposes a variant of MPPT that allows low-footprint streaming models to be trained effectively by computing the MPPT loss on masked and unmasked frames.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

SCA: Streaming Cross-attention Alignment for Echo Cancellation

no code implementations • 1 Nov 2022 • Yang Liu, Yangyang Shi, Yun Li, Kaustubh Kalgaonkar, Sriram Srinivasan, Xin Lei

End-to-End deep learning has shown promising results for speech enhancement tasks, such as noise suppression, dereverberation, and speech separation.

Speech Enhancement Speech Separation

Paper
Add Code

Learning a Dual-Mode Speech Recognition Model via Self-Pruning

no code implementations • 25 Jul 2022 • Chunxi Liu, Yuan Shangguan, Haichuan Yang, Yangyang Shi, Raghuraman Krishnamoorthi, Ozlem Kalinli

There is growing interest in unifying the streaming and full-context automatic speech recognition (ASR) networks into a single end-to-end ASR model to simplify the model training and deployment for both use cases.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Streaming parallel transducer beam search with fast-slow cascaded encoders

no code implementations • 29 Mar 2022 • Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael L Seltzer

Streaming ASR with strict latency constraints is required in many speech recognition applications.

Low-latency processing speech-recognition +1

Paper
Add Code

TorchAudio: Building Blocks for Audio and Speech Processing

2 code implementations • 28 Oct 2021 • Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z. Yang, Jason Lian, Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Goldsborough, Prabhat Roy, Sean Narenthiran, Shinji Watanabe, Soumith Chintala, Vincent Quenneville-Bélair, Yangyang Shi

This document describes version 0. 10 of TorchAudio: building blocks for machine learning applications in the audio and speech processing domain.

BIG-bench Machine Learning valid

2,379

Paper
Code

Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

no code implementations • 7 Oct 2021 • Dawei Liang, Yangyang Shi, Yun Wang, Nayan Singhal, Alex Xiao, Jonathan Shaw, Edison Thomaz, Ozlem Kalinli, Mike Seltzer

Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life.

Event Detection

Paper
Add Code

Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

no code implementations • 7 Oct 2021 • Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar, Xiaohui Zhang, Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem Kalinli, Mike Seltzer

This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution.

speech-recognition Speech Recognition

Paper
Add Code

On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

no code implementations • 9 Jul 2021 • Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, Yangyang Shi, Nayan Singhal, Julian Chan, Fuchun Peng, Yatharth Saraf, Mike Seltzer

Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Collaborative Training of Acoustic Encoders for Speech Recognition

no code implementations • 16 Jun 2021 • Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael L. Seltzer, Vikas Chandra

On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets.

speech-recognition Speech Recognition

Paper
Add Code

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

no code implementations • 6 Apr 2021 • Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

no code implementations • 6 Apr 2021 • Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion

no code implementations • 5 Apr 2021 • Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer

How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area.

Language Modelling speech-recognition +1

Paper
Add Code

Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

no code implementations • 5 Apr 2021 • Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

DET gets similar accuracy as a baseline model with better latency on a large in-house data set by assigning a lightweight encoder for the beginning part of one utterance and a full-size encoder for the rest.

speech-recognition Speech Recognition

Paper
Add Code

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

no code implementations • 3 Nov 2020 • Ching-Feng Yeh, Yongqiang Wang, Yangyang Shi, Chunyang Wu, Frank Zhang, Julian Chan, Michael L. Seltzer

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

no code implementations • 27 Oct 2020 • Yongqiang Wang, Yangyang Shi, Frank Zhang, Chunyang Wu, Julian Chan, Ching-Feng Yeh, Alex Xiao

We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks.

speech-recognition Speech Recognition +1

Paper
Add Code

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

1 code implementation • 21 Oct 2020 • Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, Mike Seltzer

For a low latency scenario with an average latency of 80 ms, Emformer achieves WER $3. 01\%$ on test-clean and $7. 09\%$ on test-other.

speech-recognition Speech Recognition

Paper
Code

Weak-Attention Suppression For Transformer Based Speech Recognition

no code implementations • 18 May 2020 • Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer

Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

no code implementations • 16 May 2020 • Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang

The memory bankstores the embedding information for all the processed seg-ments.

Paper
Add Code

Knowledge Distillation For Recurrent Neural Network Language Modeling With Trust Regularization

no code implementations • 8 Apr 2019 • Yangyang Shi, Mei-Yuh Hwang, Xin Lei, Haoyu Sheng

Using knowledge distillation with trust regularization, we reduce the parameter size to a third of that of the previously published best model while maintaining the state-of-the-art perplexity result on Penn Treebank data.

Knowledge Distillation Language Modelling +2