Codified audio language modeling learns useful representations for music information retrieval

12 Jul 2021  ·  Rodrigo Castellon, Chris Donahue, Percy Liang ·

We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox (Dhariwal et al. 2020): a music generation system containing a language model trained on codified audio from 1M songs. To determine if Jukebox's representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from conventional MIR models which are pre-trained on tagging, we find that using representations from Jukebox as input features yields 30% stronger performance on average across four MIR tasks: tagging, genre classification, emotion recognition, and key detection. For key detection, we observe that representations from Jukebox are considerably stronger than those from models pre-trained on tagging, suggesting that pre-training via codified audio language modeling may address blind spots in conventional approaches. We interpret the strength of Jukebox's representations as evidence that modeling audio instead of tags provides richer representations for MIR.

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Emotion Recognition Emomusic Jukebox (Pre-training: CALM) EmoA 72.1 # 1
EmoV 61.7 # 1
Emotion Recognition Emomusic CLMR (Pre-training: contrastive) EmoA 67.8 # 2
EmoV 45.8 # 2
Key Detection Giantsteps Jukebox (Pre-training: CALM) Accuracy 66.7 # 2
Key Detection Giantsteps CLMR (Pre-training: contrastive) Accuracy 14.9 # 3
Music Genre Classification GTZAN Jukebox (Pre-training: CALM) Accuracy 79.7 # 3
Music Genre Classification GTZAN CLMR (Pre-training: contrastive) Accuracy 68.6 # 4
Music Tagging MagnaTagATune CLMR (Pre-training: contrastive) MTT_AUC 89.4 # 2
MTT_AP 36.1 # 2
Music Tagging MagnaTagATune Jukebox (Pre-training: CALM) MTT_AUC 91.5 # 1
MTT_AP 41.4 # 1