Search Results for author: Yao Qian

Found 30 papers, 11 papers with code

Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

no code implementations11 Nov 2024 Midia Yousefi, Yao Qian, Junkun Chen, Gang Wang, Yanqing Liu, Dongmei Wang, Xiaofei Wang, Jian Xue

End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years.

Decoder Machine Translation +2

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

no code implementations8 Jun 2024 Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei

This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time.

Speech Synthesis Text to Speech +1

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

1 code implementation28 May 2024 Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation.

Machine Translation speech-recognition +4

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

no code implementations10 Apr 2024 Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng

In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation.

Dialogue Generation Text to Speech

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

1 code implementation17 Jul 2023 Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, Guoli Ye, Yao Qian, Yifan Gong, Ed Lin, Michael Zeng

Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions.

Decoder Language Modelling +3

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

no code implementations30 May 2023 Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng

State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

1 code implementation NeurIPS 2023 Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng, Xuedong Huang

Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language.

Language Modelling Multi-Task Learning +2

i-Code Studio: A Configurable and Composable Framework for Integrative AI

no code implementations23 May 2023 Yuwei Fang, Mahmoud Khademi, Chenguang Zhu, ZiYi Yang, Reid Pryzant, Yichong Xu, Yao Qian, Takuya Yoshioka, Lu Yuan, Michael Zeng, Xuedong Huang

Artificial General Intelligence (AGI) requires comprehensive understanding and generation capabilities for a variety of tasks spanning different modalities and functionalities.

Question Answering Speech-to-Speech Translation +3

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

no code implementations21 May 2023 ZiYi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang

The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities.

Decoder Diversity

Target Sound Extraction with Variable Cross-modality Clues

1 code implementation15 Mar 2023 Chenda Li, Yao Qian, Zhuo Chen, Dongmei Wang, Takuya Yoshioka, Shujie Liu, Yanmin Qian, Michael Zeng

Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources.

AudioCaps Target Sound Extraction

Self-Supervised Learning for speech recognition with Intermediate layer supervision

1 code implementation16 Dec 2021 Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, Zhenglu Yang

Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.

Language Modelling Self-Supervised Learning +2

Multilingual Speech Recognition using Knowledge Transfer across Learning Processes

no code implementations15 Oct 2021 Rimita Lahiri, Kenichi Kumatani, Eric Sun, Yao Qian

Multilingual end-to-end(E2E) models have shown a great potential in the expansion of the language coverage in the realm of automatic speech recognition(ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

6 code implementations ACL 2022 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +8

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

no code implementations11 Oct 2021 Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, Yu Wu

In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset

no code implementations12 Jul 2021 Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Yao Qian, Kenichi Kumatani, Furu Wei

Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset.

Self-Supervised Learning speech-recognition +1

Speech-language Pre-training for End-to-end Spoken Language Understanding

no code implementations11 Feb 2021 Yao Qian, Ximo Bian, Yu Shi, Naoyuki Kanda, Leo Shen, Zhen Xiao, Michael Zeng

End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module.

Ranked #3 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)

Decoder Language Modelling +2

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

5 code implementations19 Jan 2021 Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang

In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner.

Multi-Task Learning Representation Learning +4

A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding

no code implementations1 Nov 2015 Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao

Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for modeling and predicting sequential data, e. g. speech utterances or handwritten documents.

Chunking Feature Engineering +4

Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network

4 code implementations21 Oct 2015 Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao

Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e. g. speech utterances or handwritten documents.

Part-Of-Speech Tagging POS +1

Cannot find the paper you are looking for? You can Submit a new open access paper.