Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert

Social Science Research Network (SSRN) 2023 · Devin Schumacher, Francis LaBounty Jr. ·

Bark, a transformer-based text-to-audio model by Suno, generates highly realistic, multilingual speech as well as other audio, including music, background noise, and simple sound effects. While this model has shown promising results, its generative nature can lead to deviations in the output based on provided prompts. In this paper, we propose a novel approach to improve the performance of Bark by leveraging Meta's encodec to extract audio codebooks and employing a pre-trained HuBert model with a linear projection head to generate semantic tokens that better match the source audio. Our method involves extracting discrete tokens from the audio codebooks using Meta's encodec and saving the fine and coarse prompts. We then use the transcript of the source audio to generate semantic tokens from the original Bark model. However, this process has limitations due to the lack of access to the wav2vec model and its associated k-means used in the original training. To overcome this, we adapt a pre-trained HuBert model with a linear projection head, training it to output tokens in the same embedding space as the unavailable model. By incorporating these enhancements, we aim to address the limitations of Bark's generative capabilities, ultimately leading to more accurate text-to-speech outputs. Our work contributes to the ongoing development of advanced artificial intelligence for text-to-audio generation and supports the research community by providing access to pre-trained model checkpoints, which are ready for inference and available for commercial use. Keywords: Bark, ai voice cloning, Suno, text-to-speech, artificial intelligence, audio generation, Meta's encodec, audio codebooks, semantic tokens, HuBert, transformer-based model, multilingual speech, wav2vec, linear projection head, embedding space, generative capabilities, pretrained model checkpoints

PDF