A Universal Learnable Audio Frontend

ICLR 2021 · Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, Marco Tagliasacchi ·

Mel-filterbanks are fixed, engineered audio features which emulate human perception and have lived through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single, universal learnable frontend that outperforms mel-filterbanks over a wide range of audio domains, including speech, music, audio events, and animal sounds, providing an unprecedented general purpose learned frontend for audio. To do so, we introduce a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks. Our system learns all operations of audio features extraction, from filtering to pooling, compression and normalization, and can be integrated into any neural network at a negligible parameter cost. We perform multi-task training on 8 diverse audio classification tasks, and show consistent improvements of our model over mel-filterbanks and previous learnable alternatives. Moreover, our system is competitive with the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.

PDF Abstract