no code implementations • 24 Feb 2024 • Duo Ma, Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li
In this paper, we investigate a new way to pre-train such a joint speech-text model to learn enhanced speech representations and benefit various speech-related downstream tasks.
no code implementations • 22 Jan 2024 • Xianghu Yue, Xiaohai Tian, Lu Lu, Malu Zhang, Zhizheng Wu, Haizhou Li
To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text.
no code implementations • 19 Jul 2023 • Jingru Lin, Xianghu Yue, Junyi Ao, Haizhou Li
We train the model based on the idea that different realisations of the same word should be close in the underlying embedding space.
no code implementations • 18 Nov 2022 • Xiaoxue Gao, Xianghu Yue, Haizhou Li
The current lyrics transcription approaches heavily rely on supervised learning with labeled data, but such data are scarce and manual labeling of singing is expensive.
no code implementations • 30 Oct 2022 • Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li
Firstly, due to the distinct characteristics between speech and text modalities, where speech is continuous while text is discrete, we first discretize speech into a sequence of discrete speech tokens to solve the modality mismatch problem.
no code implementations • 27 Sep 2019 • Xianghu Yue, Grandee Lee, Emre Yilmaz, Fang Deng, Haizhou Li
In this work, we describe an E2E ASR pipeline for the recognition of CS speech in which a low-resourced language is mixed with a high resourced language.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 18 Jun 2019 • Emre Yilmaz, Samuel Cohen, Xianghu Yue, David van Leeuwen, Haizhou Li
This archive contains recordings with monolingual Frisian and Dutch speech segments as well as Frisian-Dutch CS speech, hence the recognition performance on monolingual segments is also vital for accurate transcriptions.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2