no code implementations • 7 Oct 2023 • JiaMing Wang, Zhihao Du, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang
In this paper, we propose LauraGPT, a unified GPT model for audio recognition, understanding, and generation.
We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis.
In addition, a two-pass decoding strategy is further proposed to fully leverage the contextual modeling ability resulting in a better recognition performance.
FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications.
Ranked #1 on Speech Recognition on WenetSpeech (using extra training data)
Recently, end-to-end neural diarization (EEND) is introduced and achieves promising results in speaker-overlapped scenarios.
Ranked #1 on Speaker Diarization on CALLHOME
Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task.
Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency.
Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings.
Ranked #1 on Speaker Diarization on AliMeeting
In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set.
To improve the robustness, a speech enhancement front-end is involved.
In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events.