Towards Robust Domain Generalization in 2D Neural Audio Processing

29 Sep 2021 · Byeonggeun Kim, Seunghan Yang, Jangho Kim, Hyunsin Park, Jun-Tae Lee, Simyung Chang ·

While using two-dimensional convolutional neural networks (2D-CNNs) in image processing, it is possible to manipulate domain information using channel statistics, and instance normalization has been a promising way to get domain-invariant features. Although 2D image features represent spatial information, 2D audio features like log-Mel spectrogram represent two different temporal and spectral information. Unlike image processing, we analyze that domain-relevant information in the audio feature is dominant in frequency statistics rather than channel statistics. Motivated by our analysis, we introduce RFN, a plug-and-play, explicit normalization module along the frequency axis, eliminating instance-specific domain discrepancy in the audio feature while relaxing undesirable loss of useful discriminative information. Empirically, simply adding RFN to networks shows clear margins compared to previous domain generalization approaches on acoustic scene classification, keyword spotting, and speaker verification tasks and yields improved robustness to audio-device, speaker-ID, or genre.

PDF Abstract