no code implementations • IWSLT (ACL) 2022 • Ziqiang Zhang, Junyi Ao
This paper describes the submission of our end-to-end YiTrans speech translation system for the IWSLT 2022 offline task, which translates from English audio to German, Chinese, and Japanese.
no code implementations • 3 Jul 2024 • Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li
Specifically, SA-WavLM follows an "extract-merge-predict" pipeline in which the representations of each speaker in the input mixture are first extracted individually and then merged before the final prediction.
1 code implementation • 19 Jun 2024 • Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu
We also conduct a comprehensive evaluation using objective evaluation methods (e. g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses.
no code implementations • 24 Feb 2024 • Duo Ma, Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li
In this paper, we investigate a new way to pre-train such a joint speech-text model to learn enhanced speech representations and benefit various speech-related downstream tasks.
no code implementations • 26 Dec 2023 • Meng Ge, Yizhou Peng, Yidi Jiang, Jingru Lin, Junyi Ao, Mehmet Sinan Yildirim, Shuai Wang, Haizhou Li, Mengling Feng
This paper summarizes our team's efforts in both tracks of the ICMC-ASR Challenge for in-car multi-channel automatic speech recognition.
no code implementations • 19 Jul 2023 • Jingru Lin, Xianghu Yue, Junyi Ao, Haizhou Li
We train the model based on the idea that different realisations of the same word should be close in the underlying embedding space.
no code implementations • 30 Oct 2022 • Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li
Firstly, due to the distinct characteristics between speech and text modalities, where speech is continuous while text is discrete, we first discretize speech into a sequence of discrete speech tokens to solve the modality mismatch problem.
1 code implementation • 7 Oct 2022 • Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu, LiRong Dai, Jinyu Li, Furu Wei
The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 12 Jun 2022 • Ziqiang Zhang, Junyi Ao, Long Zhou, Shujie Liu, Furu Wei, Jinyu Li
The YiTrans system is built on large-scale pre-trained encoder-decoder models.
1 code implementation • 31 Mar 2022 • Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, LiRong Dai, Jinyu Li, Yao Qian, Furu Wei
In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
1 code implementation • 29 Mar 2022 • Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li
LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with the HuBERT size, achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters, and obtains a $3. 5\times$ compression ratio in three SUPERB tasks, e. g., automatic speaker verification, keyword spotting, and intent classification, with a slight accuracy loss.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +7
4 code implementations • ACL 2022 • Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +8
no code implementations • 11 Oct 2021 • Rui Wang, Junyi Ao, Long Zhou, Shujie Liu, Zhihua Wei, Tom Ko, Qing Li, Yu Zhang
In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition.