YouTube8M-MusicTextClips

The YouTube8M-MusicTextClips dataset consists of over 4k high-quality human text descriptions of music found in video clips from the YouTube8M dataset.

For each selected YouTube music video, we extracted 10 second clips at the middle of the video for annotation. We provided annotators with only the audio corresponding to this clip. Thus, text annotations describe audio alone, not the visual content of the clip.

The dataset annotations are divided into train and test split files. As the dataset is meant mainly for evaluation, there are 3169 annotated clips from the test set and only 1000 annotated clips from the train set.

Homepage