AST: Audio Spectrogram Transformer

5 Apr 2021  ·  Yuan Gong, Yu-An Chung, James Glass ·

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Audio Classification AudioSet AST (Ensemble) Test mAP 0.485 # 18
Audio Tagging AudioSet Audio Spectrogram Transformer mean average precision 0.485 # 5
Audio Classification AudioSet AST (Single) Test mAP 0.459 # 35
Speech Emotion Recognition CREMA-D ViT Accuracy 67.81 # 7
Audio Classification ESC-50 Audio Spectrogram Transformer Top-1 Accuracy 95.7 # 14
PRE-TRAINING DATASET AudioSet, ImageNet # 1
Accuracy (5-fold) 95.7 # 14
Keyword Spotting Google Speech Commands Audio Spectrogram Transformer Google Speech Commands V2 35 98.11 # 5
Audio Classification Speech Commands AST-S Accuracy 98.11±0.05 # 2
Time Series Analysis Speech Commands ViT % Test Accuracy 98.11 # 2

Methods