ATST: Audio Representation Learning with Teacher-Student Transformer

26 Apr 2022  ·  Xian Li, Xiaofei Li ·

Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the knowledge to a specific problem with a limited number of labeled data. SSL has achieved promising results in various domains. This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST. A transformer encoder is developed on a recently emerged teacher-student baseline scheme, which largely improves the modeling capability of pre-training. In addition, a new strategy for positive pair creation is designed to fully leverage the capability of transformer. Extensive experiments have been conducted, and the proposed model achieves the new state-of-the-art results on almost all of the downstream tasks.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Audio Classification Balanced Audio Set Base (ours) Mean AP 37.4 # 2
Spoken Command Recognition Speech Command v2 Base (ours) Accuracy 98.0 # 2
Speaker Identification VoxCeleb1 ATST Base (ours) Top-1 (%) 94.3 # 4
Accuracy 94.3 # 4