We introduce a convolutional recurrent neural network (CRNN) for music tagging.
Lastly, we show that self-supervised pre-training allows us to learn efficiently on smaller labeled datasets: we still achieve a score of 33. 1% despite using only 259 labeled songs during fine-tuning.
Ranked #1 on Music Auto-Tagging on MagnaTagATune (ROC AUC metric)
Recently, the end-to-end approach that learns hierarchical representations from raw data using deep convolutional neural networks has been successfully explored in the image, text and speech domains.
However, the MIR field is still dominated by the classical VGG-based CNN architecture variants, often in combination with more complex modules such as attention, and/or techniques such as pre-training on large datasets.
While applications of transfer learning are common in the fields of computer vision and natural language processing, audio- and speech processing are surprisingly lacking readily available and transferable models.
Deep convolutional neural networks (CNNs) have been actively adopted in the field of music information retrieval, e. g. genre classification, mood detection, and chord recognition.
Music tag words that describe music audio by text have different levels of abstraction.