SubSpectral Normalization for Neural Audio Data Processing

25 Mar 2021  ·  Simyung Chang, Hyoungwoo Park, Janghoon Cho, Hyunsin Park, Sungrack Yun, Kyuwoong Hwang ·

Convolutional Neural Networks are widely used in various machine learning domains. In image processing, the features can be obtained by applying 2D convolution to all spatial dimensions of the input... However, in the audio case, frequency domain input like Mel-Spectrogram has different and unique characteristics in the frequency dimension. Thus, there is a need for a method that allows the 2D convolution layer to handle the frequency dimension differently. In this work, we introduce SubSpectral Normalization (SSN), which splits the input frequency dimension into several groups (sub-bands) and performs a different normalization for each group. SSN also includes an affine transformation that can be applied to each group. Our method removes the inter-frequency deflection while the network learns a frequency-aware characteristic. In the experiments with audio data, we observed that SSN can efficiently improve the network's performance. read more

PDF Abstract

Results from the Paper


 Ranked #1 on Keyword Spotting on Google Speech Commands (% Test Accuracy metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Keyword Spotting Google Speech Commands res8 w/ SSN(S=4, A=Sub) % Test Accuracy 95.4% ±0.22 # 1
Keyword Spotting Google Speech Commands res15 w/ SSN(S=4, A=Sub) % Test Accuracy 96.8% ±0.13 # 2
Keyword Spotting Google Speech Commands res15 w/ SSN(S=4, A=Sub) (2019) % Test Accuracy 97.5% ±0.15 # 3
Keyword Spotting TAU Urban Acoustic Scenes 2019 CP-ResNet(ch64) w/ SSN(S=2, A=Sub) Accuracy 83.6% ±0.07 # 1
Keyword Spotting TAU Urban Acoustic Scenes 2019 CP-ResNet(ch128) w/ SSN(S=2, A=Sub) Accuracy 84.1% ±0.20 # 2

Methods