Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism.
It involves first pre-training a model on a large amount of unlabeled data, then adapting the model to target tasks of interest.
Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels.
Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties.
Stories generated with neural language models have shown promise in grammatical and stylistic consistency.
We are interested in the task of generating multi-instrumental music scores.
Existing research on music generation focuses on composition, but often ignores the expressive performance characteristics required for plausible renditions of resultant pieces.
Recent advances in deep neural networks have enabled algorithms to compose music that is comparable to music composed by humans.
Sound Audio and Speech Processing