Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert

Bark, a transformer-based text-to-audio model by Suno, generates highly realistic, multilingual speech as well as other audio, including music, background noise, and simple sound effects. While this model has shown promising results, its generative nature can lead to deviations in the output based on provided prompts. In this paper, we propose a novel approach to improve the performance of Bark by leveraging Meta's encodec to extract audio codebooks and employing a pre-trained HuBert model with a linear projection head to generate semantic tokens that better match the source audio. Our method involves extracting discrete tokens from the audio codebooks using Meta's encodec and saving the fine and coarse prompts. We then use the transcript of the source audio to generate semantic tokens from the original Bark model. However, this process has limitations due to the lack of access to the wav2vec model and its associated k-means used in the original training. To overcome this, we adapt a pre-trained HuBert model with a linear projection head, training it to output tokens in the same embedding space as the unavailable model. By incorporating these enhancements, we aim to address the limitations of Bark's generative capabilities, ultimately leading to more accurate text-to-speech outputs. Our work contributes to the ongoing development of advanced artificial intelligence for text-to-audio generation and supports the research community by providing access to pre-trained model checkpoints, which are ready for inference and available for commercial use. Keywords: Bark, ai voice cloning, Suno, text-to-speech, artificial intelligence, audio generation, Meta's encodec, audio codebooks, semantic tokens, HuBert, transformer-based model, multilingual speech, wav2vec, linear projection head, embedding space, generative capabilities, pretrained model checkpoints

PDF

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here