MUSAN is a corpus of music, speech and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises.
194 PAPERS • NO BENCHMARKS YET
The MAESTRO dataset contains over 200 hours of paired audio and MIDI recordings from ten years of International Piano-e-Competition. The MIDI data includes key strike velocities and sustain/sostenuto/una corda pedal positions. Audio and MIDI files are aligned with ∼3 ms accuracy and sliced to individual musical pieces, which are annotated with composer, title, and year of performance. Uncompressed audio is of CD quality or higher (44.1–48 kHz 16-bit PCM stereo).
108 PAPERS • 1 BENCHMARK
MusicCaps is a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. For each 10-second music clip, MusicCaps provides:
69 PAPERS • 1 BENCHMARK
MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.
41 PAPERS • 1 BENCHMARK
The MTG-Jamendo dataset is an open dataset for music auto-tagging. The dataset contains over 55,000 full audio tracks with 195 tags categories (87 genre tags, 40 instrument tags, and 56 mood/theme tags). It is built using music available at Jamendo under Creative Commons licenses and tags provided by content uploaders. All audio is distributed in 320kbps MP3 format.
35 PAPERS • NO BENCHMARKS YET
The JSB chorales are a set of short, four-voice pieces of music well-noted for their stylistic homogeneity. The chorales were originally composed by Johann Sebastian Bach in the 18th century. He wrote them by first taking pre-existing melodies from contemporary Lutheran hymns and then harmonising them to create the parts for the remaining three voices. The version of the dataset used canonically in representation learning contexts consists of 382 such chorales, with a train/validation/test split of 229, 76 and 77 samples respectively.
32 PAPERS • 1 BENCHMARK
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
12 PAPERS • 1 BENCHMARK
ASAP is a dataset of 222 digital musical scores aligned with 1068 performances (more than 92 hours) of Western classical piano music.
11 PAPERS • 2 BENCHMARKS
The Lakh Pianoroll Dataset (LPD) is a collection of 174,154 multitrack pianorolls derived from the Lakh MIDI Dataset (LMD).
9 PAPERS • NO BENCHMARKS YET
First large-scale symphony generation dataset.
9 PAPERS • 1 BENCHMARK
AIOZ-GDANCE comprises 16.7 hours of whole-body motion and music audio of group dancing. The duration of each video in our dataset is ranging from 15 to 60 seconds.
7 PAPERS • 1 BENCHMARK
The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/val/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these sys
6 PAPERS • NO BENCHMARKS YET
ATEPP is a dataset of expressive piano performances by virtuoso pianists. The dataset contains 11677 performances (~1000 hours) by 49 pianists and covers 1580 movements by 25 composers. All of the MIDI files in the dataset come from the piano transcription of existing audio recordings of piano performances. Scores in MusicXML format are also available for around half of the tracks. The dataset is organized and aligned by compositions and movements for comparative studies.
5 PAPERS • NO BENCHMARKS YET
The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval.
5 PAPERS • 1 BENCHMARK
ComMU has 11,144 MIDI samples that consist of short note sequences created by professional composers with their corresponding 12 metadata. This dataset is designed for a new task, combinatorial music generation which generate diverse and high-quality music only with metadata through auto-regressive language model.
4 PAPERS • NO BENCHMARKS YET
For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in
A large-scale and diverse duet interactive dance dataset. Recording about 117 minutes of professional dancers' performances.
3 PAPERS • NO BENCHMARKS YET
Giantsteps is a dataset that includes songs in major and minor scales for all pitch classes, i.e., a 24-way classification task.
3 PAPERS • 1 BENCHMARK
A MIDI dataset of 500 4-part chorales generated by the KS_Chorus algorithm, annotated with results from hundreds of listening test participants, with 500 further unannotated chorales.
The MusicBench dataset is a music audio-text pair dataset that was designed for text-to-music generation purpose and released along with Mustango text-to-music model. MusicBench is based on the MusicCaps dataset, which it expands from 5,521 samples to 52,768 training and 400 test samples!
This dataset includes all music sources, background noises and impulse-reponses (IR) samples and conversation speech that have been used in the work "Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning" ICASSP 2021 (https://arxiv.org/abs/2010.11910).
2 PAPERS • NO BENCHMARKS YET
The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality songs, forming the largest recorded singing dataset; (2) 20 professional singers across nine languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompa
The Image-MusicEmotion-Matching-Net (IMEMNet) dataset is a dataset for continuous emotion-based image and music matching. It has over 140K image-music pairs.
JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.
2 PAPERS • 5 BENCHMARKS
Lyra is a dataset of 1570 traditional and folk Greek music pieces that includes audio and video (timestamps and links to YouTube videos), along with annotations that describe aspects of particular interest for this dataset, including instrumentation, geographic information and labels of genre and subgenre, among others.
The YouTube8M-MusicTextClips dataset consists of over 4k high-quality human text descriptions of music found in video clips from the YouTube8M dataset.
The AIME dataset contains 6,000 audio tracks generated by 12 music generation models in addition to 500 tracks from MTG-Jamendo. The prompts used to generate music are combinations of representative and diverse tags from the MTG-Jamendo dataset.
1 PAPER • NO BENCHMARKS YET
We propose a dataset, AVASpeech-SMAD, to assist speech and music activity detection research. With frame-level music labels, the proposed dataset extends the existing AVASpeech dataset, which originally consists of 45 hours of audio and speech activity labels. To the best of our knowledge, the proposed AVASpeech-SMAD is the first open-source dataset that features strong polyphonic labels for both music and speech. The dataset was manually annotated and verified via an iterative cross-checking process. A simple automatic examination was also implemented to further improve the quality of the labels. Evaluation results from two state-of-the-art SMAD systems are also provided as a benchmark for future reference.
ChMusic is a traditional Chinese music dataset for training model and performance evaluation of musical instrument recognition. This dataset cover 11 musical instruments, consisting of Erhu, Pipa, Sanxian, Dizi, Suona, Zhuiqin, Zhongruan, Liuqin, Guzheng, Yangqin and Sheng.
We construct a large-scale conducting motion dataset, named ConductorMotion100, by deploying pose estimation on conductor view videos of concert performance recordings collected from online video platforms. The construction of ConductorMotion100 removes the need for expensive motion-capture equipment and makes full use of massive online video resources. As a result, the scale of ConductorMotion100 has reached an unprecedented length of 100 hours.
Dizi is a dataset of music style of the Northern school and the Southern School. Characteristics include melody and playing techniques of the two different music styles are deconstructed.
1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same package song ids 1 to 1000. Some redundancies were identified, which reduced the dataset down to 744 songs. The dataset is split between the development set (619 songs) and the evaluation set (125 songs). The extracted 45 seconds excerpts are all re-encoded to have the same sampling frequency, i.e, 44100Hz.
1 PAPER • 1 BENCHMARK
48 multitrack jazz recordings with many annotations.
1 PAPER • 2 BENCHMARKS
The Haydn Annotation Dataset consists of note onset annotations from 24 experiment participants with varying musical experience. The annotation experiments use recordings from the ARME Virtuoso Strings Dataset.
This dataset contains annotations for 5000 music files on the following music properties:
A dataset of music videos with continuous valence/arousal ratings as well as emotion tags.
Nlakh is a dataset for Musical Instrument Retrieval. It is a combination of the NSynth dataset, which provides a large number of instruments, and the Lakh dataset, which provides multi-track MIDI data.
The SynthSOD dataset contains more than 47 hours of multitrack music obtained by synthesizing orchestra and ensemble pieces from the Symbolic Orchestral Database (SOD) using Spitfire BBC Symphony Orchestra Professional Library. To synthesize the MIDI files from the SOD, we needed to fix the original files into the General MIDI standard, select a subsect of files that fitted into our requirements (e.g., containing only instruments that we could synthesize), and develop a new system to generate musically-motivated random annotations about tempo, dynamic, and articulation.
Virtuoso Strings is a dataset for soft onsets detection for string instruments. It consists of over 144 recordings of professional performances of an excerpt from Haydn's string quartet Op. 74 No. 1 Finale, each with corresponding individual instrumental onset annotations.
YM2413-MDB is an 80s FM video game music dataset with multi-label emotion annotations. It includes 669 audio and MIDI files of music from Sega and MSX PC games in the 80s using YM2413, a programmable sound generator based on FM. The collected game music is arranged with a subset of 15 monophonic instruments and one drum instrument.
InaGVAD is a Voice Activity Detection (VAD) and Speaker Gender Segmentation (SGS) dataset designed for representing the acoustic diversity of French TV and Radio programs. InaGVAD detailed description, together with a benchmark of 6 freely available VAD systems and 3 SGS systems, is provided in a paper presented in LREC-COLING 2024.
jaCappella is a corpus of Japanese a cappella vocal ensembles (jaCappella corpus) for vocal ensemble separation and synthesis. It consists of 35 copyright-cleared vocal ensemble songs and their audio recordings of individual voice parts. These songs were arranged from out-of-copyright Japanese children's songs and have six voice parts (lead vocal, soprano, alto, tenor, bass, and vocal percussion). They are divided into seven subsets, each of which features typical characteristics of a music genre such as jazz and enka.
This publicly available data is synthesised audio for woodwind quartets including renderings of each instrument in isolation. The data was created to be used as training data within Cadenza's second open machine learning challenge (CAD2) for the task on rebalancing classical music ensembles. The dataset is also intended for developing other music information retrieval (MIR) algorithms using machine learning. It was created because of the lack of large-scale datasets of classical woodwind music with separate audio for each instrument and permissive license for reuse. Music scores were selected from the OpenScore String Quartet corpus. These were rendered for two woodwind ensembles of (i) flute, oboe, clarinet and bassoon; and (ii) flute, oboe, alto saxophone and bassoon. This was done by a professional music producer using industry-standard software. Virtual instruments were used to create the audio for each instrument using software that interpreted expression markings in the score. Co
0 PAPER • NO BENCHMARKS YET