Self-Supervised Contrastive Learning for Singing Voices

IEEE/ACM Transactions on Audio, Speech, and Language Processing 2022 · Hiromu Yakura, Kento Watanabe, Masataka Goto ·

This study introduces self-supervised contrastive learning to acquire feature representations of singing voices. To acquire robust representations in an unsupervised manner, regular self-supervised contrastive learning trains neural networks to make the feature representation of a sample close to those of its computationally transformed versions. Similarly, we employ two transformations—pitch shifting and time stretching—considering the nature of singing voices. Nevertheless, we use them reversely: we train networks to push away representations of the transformed versions. The networks then attempt to discriminate changes in vocal timbres introduced by pitch shifting without time stretching and those in singing expressions introduced by time stretching without pitch shifting. Consequently, the acquired representations become attentive to vocal timbre and singing expression. This was confirmed through a singer identification task, where we trained a classifier to learn the relationship between the feature representations to the corresponding singer labels of 500 singers. As a result, the employed transformations helped the classifier improve the classification accuracy by 9.12% (top-1 accuracy: 63.08%) compared with the case where the feature representations fed to the classifier were acquired without the transformations (top-1 accuracy: 53.96%). Furthermore, the proposed approach can be extended to acquire feature representations attentive to either vocal timbre or singing expression but not to the other by changing how the transformations are incorporated. We particularly explored the characteristics of such vocal timbre- or singing expression-oriented feature representations against song genre, singer gender, and vocal technique, and confirmed that they successfully capture different aspects of singing voices.

PDF Abstract