Specifically, all these models utilize Loss Masking on the input speech tokens for the ASR task, which means that these models do not explicitly model the dependency between the speech tokens.
no code implementations • 7 Oct 2023 • JiaMing Wang, Zhihao Du, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang
In this paper, we propose LauraGPT, a unified GPT model for audio recognition, understanding, and generation.
Speaker diarization has gained considerable attention within speech processing research community.
We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis.
It assigns representation of augmented views of utterances to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views.
Transformer-based pre-trained language models, such as BERT, achieve great success in various natural language understanding tasks.
Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community.
In this paper, we propose methods to extract speaker-related information from semantic content in multi-party meetings, which, as we will show, can further benefit speaker diarization.
This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance.
Prior studies diagnose the anisotropy problem in sentence representations from pre-trained language models, e. g., BERT, without fine-tuning.
Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual realities.
A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the regularized DINO framework in speaker verification.
While promising performance for speaker verification has been achieved by deep speaker embeddings, the advantage would reduce in the case of speaking-style variability.
In this paper we propose to view clustering-based diarization as a community detection problem.
In this work, we present a GCN-based approach for semi-supervised learning.
Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings.
Ranked #1 on Speaker Diarization on AliMeeting
In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set.
We propose a novel Pooling Network (PoNet) for token mixing in long sequences with linear complexity.
We propose BeamTransformer, an efficient architecture to leverage beamformer's edge in spatial filtering and transformer's capability in context sequence modeling.
COVID-19, as a global health crisis, has triggered the fear emotion with unprecedented intensity.
In this paper we describe a speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting.
Applications of the framework with Chinese data reveal highly heterogeneous health benefits of reducing fossil fuel use in different sectors and regions in China with a mean of \$34/tCO2 and a standard deviation of \$84/tCO2.