This paper proposes a new self-attention based model for music score infilling, i. e., to generate a polyphonic music sequence that fills in the gap between given past and future contexts.
In this work, we present DadaGP, a new symbolic music dataset comprising 26, 181 song scores in the GuitarPro format covering 739 musical genres, along with an accompanying tokenized format well-suited for generative sequence models such as the Transformer.
This paper presents an attempt to employ the mask language modeling approach of BERT to pre-train a 12-layer Transformer model over 4, 166 pieces of polyphonic piano MIDI files for tackling a number of symbolic-domain discriminative music understanding tasks.
This paper presents a novel system architecture that integrates blind source separation with joint beat and downbeat tracking in musical audio signals.
Due to advances in deep learning, the performance of automatic beat and downbeat tracking in musical audio signals has seen great improvement in recent years.
Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity.
Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e. g., MIDI) domain music generation.
In this paper, we present a conceptually different approach that explicitly takes into account the type of the tokens, such as note types and metric types.
We present the Freesound Loop Dataset (FSLD), a new large-scale dataset of music loops annotated by experts.
Audio and Speech Processing Sound
A DJ mix is a sequence of music tracks concatenated seamlessly, typically rendered for audiences in a live setting by a DJ on stage.
Blind music source separation has been a popular and active subject of research in both the music information retrieval and signal processing communities.
Using this data, we investigate two types of model architectures for estimating the compatibility of loops: one based on a Siamese network, and the other a pure convolutional neural network (CNN).
Deep learning algorithms are increasingly developed for learning to compose music in the form of MIDI files.
Sound Audio and Speech Processing
This paper presents the Jazz Transformer, a generative model that utilizes a neural sequence model called the Transformer-XL for modeling lead sheets of Jazz music.
Specifically, given a speech input, and optionally the F0 contour of the target singing, the proposed model generates as the output a singing signal with a progressive-growing encoder/decoder architecture and boundary equilibrium GAN loss functions.
Audio examples, as well as the code for implementing our model, will be publicly available online upon paper publication.
In this study, we examine whether we can analyze and compare Western and Chinese classical music based on soundscape models.
A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer only sings in certain musical contexts (e. g., genres).
In contrast with this general approach, this paper shows that Transformers can do even better for music modeling, when we improve the way a musical score is converted into the data fed to a Transformer model.
Several prior works have proposed various methods for the task of automatic melody harmonization, in which a model aims to generate a sequence of chords to serve as the harmonic accompaniment of a given multiple-bar melody sequence.
Generative models for singing voice have been mostly concerned with the task of ``singing voice synthesis,'' i. e., to produce singing voice waveforms given musical scores and text lyrics.
In this paper, we tackle the problem of transfer learning for Jazz automatic generation.
To reach information at remote locations, we propose to combine dilated convolution with a modified version of gated recurrent units (GRU) called the `Dilated GRU' to form a block.
We investigate disentanglement techniques such as adversarial training to separate latent factors that are related to the musical content (pitch) of different parts of the piece, and that are related to the instrumentation (timbre) of the parts per short-time segment.
Audio and Speech Processing Sound
We present collaborative similarity embedding (CSE), a unified framework that exploits comprehensive collaborative relations available in a user-item bipartite graph for representation learning and recommendation.
Ranked #1 on Recommendation Systems on Netflix (Recall@10 metric)
New machine learning algorithms are being developed to solve problems in different areas, including music.
Human-Computer Interaction D.2.2; H.5.2; H.5.5
In this paper, we aim to gain a deeper understanding of adversarial losses by decoupling the effects of their component functions and regularization terms.
In this paper, we introduce a novel attentional similarity module for the problem of few-shot sound recognition.
Sound Audio and Speech Processing
To build such an AI performer, we propose in this paper a deep convolutional model that learns in an end-to-end manner the score-to-audio mapping between a symbolic representation of music called the piano rolls and an audio representation of music called the spectrograms.
Sound Multimedia Audio and Speech Processing
Our experiments on both vocal melody extraction and general melody extraction validate the effectiveness of the proposed model.
We propose the BinaryGAN, a novel generative adversarial network (GAN) that uses binary neurons at the output layer of the generator.
A new recurrent convolutional generative model for the task is proposed, along with three new symbolic-domain harmonic features to facilitate learning from unpaired lead sheets and MIDIs.
Can we make a famous rap singer like Eminem sing whatever our favorite song?
In this work, we propose a denoising Auto-encoder with Recurrent skip Connections (ARC).
Vector-valued neural learning has emerged as a promising direction in deep learning recently.
Instrument playing is among the most common scenes in music-related videos, which represent nowadays one of the largest sources of online videos.
Experimental results show that using binary neurons instead of HT or BS indeed leads to better results in a number of objective measures.
In a previous work, we introduced an attention-based convolutional recurrent neural network that uses music emotion classification as a surrogate task for music highlight extraction, for Pop songs.
Singing voice separation attempts to separate the vocal and instrumental parts of a music recording, which is a fundamental problem in music information retrieval.
Informed by recent work on tensor singular value decomposition and circulant algebra matrices, this paper presents a new theoretical bridge that unifies the hypercomplex and tensor-based approaches to singular value decomposition and robust principal component analysis.
Thus, in this letter, we extend principal component pursuit to the complex and quaternionic cases to account for the missing phase information.
Many existing methods adopt a uniform sampling method to reduce learning complexity, but when the network is non-uniform (i. e. a weighted network) such uniform sampling incurs information loss.
A model for hit song prediction can be used in the pop music industry to identify emerging trends and potential artists or songs before they are marketed to the public.
The three models, which differ in the underlying assumptions and accordingly the network architectures, are referred to as the jamming model, the composer model and the hybrid model.
Being able to predict whether a song can be a hit has impor- tant applications in the music industry.
We conduct a user study to compare the melody of eight-bar long generated by MidiNet and by Google's MelodyRNN models, each time using the same priming melody.
Recent years have witnessed an increased interest in the application of persistent homology, a topological tool for data analysis, to machine learning problems.