MetaAudio: A Few-Shot Audio Classification Benchmark

5 Apr 2022  ·  Calum Heggan, Sam Budgett, Timothy Hospedales, Mehrdad Yaghoobi ·

Currently available benchmarks for few-shot learning (machine learning with few training examples) are limited in the domains they cover, primarily focusing on image classification. This work aims to alleviate this reliance on image-based benchmarks by offering the first comprehensive, public and fully reproducible audio based alternative, covering a variety of sound domains and experimental settings. We compare the few-shot classification performance of a variety of techniques on seven audio datasets (spanning environmental sounds to human-speech). Extending this, we carry out in-depth analyses of joint training (where all datasets are used during training) and cross-dataset adaptation protocols, establishing the possibility of a generalised audio few-shot classification algorithm. Our experimentation shows gradient-based meta-learning methods such as MAML and Meta-Curvature consistently outperform both metric and baseline methods. We also demonstrate that the joint training routine helps overall generalisation for the environmental sound databases included, as well as being a somewhat-effective method of tackling the cross-dataset/domain setting.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Few-Shot Audio Classification BirdClef 2020 (Pruned) Prototypical Networks (CRNN) Top-1 Accuracy(5-Way-1-Shot) 56.11 +- 0.46 # 5
Few-Shot Audio Classification BirdClef 2020 (Pruned) MAML (CRNN) Top-1 Accuracy(5-Way-1-Shot) 56.26 +- 0.45 # 4
Few-Shot Audio Classification BirdClef 2020 (Pruned) SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 36.41 +- 0.42 # 6
Few-Shot Audio Classification BirdClef 2020 (Pruned) SimpleShot CL2N (AST ImageNet - No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 33.04 +- 0.41 # 7
Few-Shot Audio Classification BirdClef 2020 (Pruned) Meta-Baseline (CRNN) Top-1 Accuracy(5-Way-1-Shot) 57.28 +- 0.41 # 3
Few-Shot Audio Classification BirdClef 2020 (Pruned) SimpleShot Cl2N (CRNN) Top-1 Accuracy(5-Way-1-Shot) 57.66 +- 0.43 # 2
Few-Shot Audio Classification BirdClef 2020 (Pruned) Meta-Curvature (CRNN) Top-1 Accuracy(5-Way-1-Shot) 61.34 +- 0.46 # 1
Few-Shot Audio Classification ESC-50 Prototypical Networks (CRNN) Top-1 Accuracy(5-Way-1-Shot) 68.83 +- 0.38 # 5
Few-Shot Audio Classification ESC-50 Meta-Baseline (CRNN) Top-1 Accuracy(5-Way-1-Shot) 71.72 +- 0.38 # 3
Few-Shot Audio Classification ESC-50 Meta-Curvature (CRNN) Top-1 Accuracy(5-Way-1-Shot) 76.17 +- 0.41 # 1
Few-Shot Audio Classification ESC-50 MAML (CRNN) Top-1 Accuracy(5-Way-1-Shot) 74.66 ± 0.42 # 2
Few-Shot Audio Classification ESC-50 SimpleShot CL2N (AST ImageNet - No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 60.41 +- 0.41 # 9
Few-Shot Audio Classification ESC-50 SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 64.48 +- 0.41 # 7
Few-Shot Audio Classification ESC-50 SimpleShot CL2N (CRNN) Top-1 Accuracy(5-Way-1-Shot) 68.82 +-0.39 # 6
Few-Shot Audio Classification FSDKaggle2018 SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 38.78 +- 0.41 # 7
Few-Shot Audio Classification FSDKaggle2018 SimpleShot CL2N (AST ImageNet - No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 33.52 +- 0.39 # 9
Few-Shot Audio Classification FSDKaggle2018 SimpleShot CL2N (CRNN) Top-1 Accuracy(5-Way-1-Shot) 42.05 +- 0.42 # 3
Few-Shot Audio Classification FSDKaggle2018 Meta-Baseline (CRNN) Top-1 Accuracy(5-Way-1-Shot) 40.27 +- 0.44 # 4
Few-Shot Audio Classification FSDKaggle2018 MAML (CRNN) Top-1 Accuracy(5-Way-1-Shot) 43.45 +- 0.46 # 1
Few-Shot Audio Classification FSDKaggle2018 Meta-Curvature (CRNN) Top-1 Accuracy(5-Way-1-Shot) 43.18 +- 0.45 # 2
Few-Shot Audio Classification FSDKaggle2018 Prototypical Networks (CRNN) Top-1 Accuracy(5-Way-1-Shot) 39.44 +- 0.44 # 5
Few-Shot Audio Classification NSynth MAML (CRNN) Top-1 Accuracy(5-Way-1-Shot) 93.85 +- 0.24 # 3
Few-Shot Audio Classification NSynth Meta-Curvature (CRNN) Top-1 Accuracy(5-Way-1-Shot) 96.47 +-0.19 # 1
Few-Shot Audio Classification NSynth SimpleShot CL2N Classifier (AST ImageNet & AudioSet - No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 63.78 +- 0.42 # 9
Few-Shot Audio Classification NSynth SimpleShot CL2N Classifier (AST pre-trained w/ ImageNet - No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 66.68 +- 0.41 # 7
Few-Shot Audio Classification NSynth Meta-Baseline (CRNN) Top-1 Accuracy(5-Way-1-Shot) 90.74 +- 0.25 # 4
Few-Shot Audio Classification NSynth SimpleShot CL2N (CRNN) Top-1 Accuracy(5-Way-1-Shot) 90.04 +- 0.27 # 5
Few-Shot Audio Classification NSynth Prototypical Networks (CRNN) Top-1 Accuracy(5-Way-1-Shot) 95.23 +- 0.19 # 2
Few-Shot Audio Classification VoxCeleb1 SimpleShot CL2N (CRNN) Top-1 Accuracy(5-Way-1-Shot) 48.50 +- 0.42 # 5
Few-Shot Audio Classification VoxCeleb1 Meta-Baseline (CRNN) Top-1 Accuracy(5-Way-1-Shot) 55.54 +- 0.42 # 4
Few-Shot Audio Classification VoxCeleb1 SimpleShot CL2N (AST ImageNet - No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 28.09 +- 0.37 # 9
Few-Shot Audio Classification VoxCeleb1 SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 28.79 +- 0.38 # 8
Few-Shot Audio Classification VoxCeleb1 Meta-Curvature (CRNN) Top-1 Accuracy(5-Way-1-Shot) 63.85 +- 0.44 # 1
Few-Shot Audio Classification VoxCeleb1 MAML (CRNN) Top-1 Accuracy(5-Way-1-Shot) 60.89 +- 0.45 # 2
Few-Shot Audio Classification VoxCeleb1 Prototypical Networks (CRNN) Top-1 Accuracy(5-Way-1-Shot) 59.64 +- 0.44 # 3
Few-Shot Audio Classification Watkins Marine Mammal Sounds SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 51.81 ± 0.42 # 4
Few-Shot Audio Classification Watkins Marine Mammal Sounds SimpleShot CL2N (AST ImageNet - No fine-tune) Top-1 Accuracy(5-Way-1-Shot) 55.40 ± 0.42 # 2

Methods