Audio Classification

133 papers with code • 20 benchmarks • 35 datasets

Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds.

Benchmarks

Add a Result

These leaderboards are used to track progress in Audio Classification

Dataset	Best Model	Compare
AudioSet	OmniVec	See all
ESC-50	InternVideo2	See all
VGGSound	Mirasol3B	See all
ICBHI Respiratory Sound Database	AST (Patch-Mix CL)	See all
SHD	Event-SSM	See all
FSD50K	ONE-PEACE	See all
Speech Commands	AST-S	See all
DCASE	CrissCross (AudioSet)	See all
Balanced Audio Set	BEATs	See all
SSC	Event-SSM	See all
EPIC-KITCHENS-100	Audiovisual Masked Autoencoder (Audiovisual, Single)	See all
BirdCLEF 2021	EfficientLEAF (8s)	See all
DiCOVA	AUCO ResNet	See all
CREMA-D	EfficientLEAF	See all
RAVDESS	ASM-RH-A	See all
VocalSound	VocalSound Baseline	See all
Multimodal PISA	MMDL	See all
UCR Time Series Classification Archive	CDIL	See all
DEEP-VOICE: DeepFake Voice Recognition	XGBoost (330)	See all
EPIC-SOUNDS	Mirasol3B (A+V)	See all

Show all 20 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Audio Classification models and implementations

Sreyan88/LAPE

3 papers

towhee-io/towhee

2 papers

3,001

google-research/leaf-audio

2 papers

474

faceonlive/ai-research

2 papers

189

See all 7 libraries.

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Augmenting Deep Classifiers with Polynomial Neural Networks

grigorisg9gr/polynomials-for-augmenting-nns • • 16 Apr 2021

The efficacy of the proposed models is evaluated on standard image and audio classification benchmarks.

Paper
Code

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

google-research/google-research • • NeurIPS 2021

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Paper
Code

Efficient Training of Audio Transformers with Patchout

kkoutini/passt • • 11 Oct 2021

However, one of the main shortcomings of transformer models, compared to the well-established CNNs, is the computational complexity.

Paper
Code

SSAST: Self-Supervised Audio Spectrogram Transformer

YuanGongND/ssast • • 19 Oct 2021

However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST.

Paper
Code

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

YuanGongND/ast • • 13 Mar 2022

Audio classification is an active research area with a wide range of applications.

Paper
Code

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

AlanBaade/MAE-AST-Public • • 30 Mar 2022

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.

Paper
Code

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

fschmid56/efficientat • • 9 Nov 2022

We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of . 483 mAP on AudioSet.

Paper
Code