A Universal Learnable Audio Frontend

Mel-filterbanks are fixed, engineered audio features which emulate human perception and have lived through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single, universal learnable frontend that outperforms mel-filterbanks over a wide range of audio domains, including speech, music, audio events, and animal sounds, providing an unprecedented general purpose learned frontend for audio. To do so, we introduce a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks. Our system learns all operations of audio features extraction, from filtering to pooling, compression and normalization, and can be integrated into any neural network at a negligible parameter cost. We perform multi-task training on 8 diverse audio classification tasks, and show consistent improvements of our model over mel-filterbanks and previous learnable alternatives. Moreover, our system is competitive with the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.

PDF Abstract
No code implementations yet. Submit your code now

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here