Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence.
Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text.
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data.
Deep neural networks have recently achieved breakthroughs in sound generation with text prompts.
Aiming at channel compression, a novel convolutional construction named compact convolution is proposed to embrace the progress in spatial convolution, channel grouping and pooling operation.