Audio Classification

131 papers with code • 23 benchmarks • 34 datasets

Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds.

Benchmarks

Add a Result

These leaderboards are used to track progress in Audio Classification

Dataset	Best Model	Compare
AudioSet	OmniVec	See all
ESC-50	InternVideo2	See all
VGGSound	Mirasol3B	See all
ICBHI Respiratory Sound Database	AST (Patch-Mix CL)	See all
SHD	SNN with Dilated Convolution with Learnable Spacings	See all
FSD50K	ONE-PEACE	See all
Speech Commands	AST-S	See all
DCASE	CrissCross (AudioSet)	See all
Balanced Audio Set	BEATs	See all
EPIC-KITCHENS-100	Audiovisual Masked Autoencoder (Audiovisual, Single)	See all
SSC	SNN with Dilated Convolution with Learnable Spacings	See all
BirdCLEF 2021	EfficientLEAF (8s)	See all
DiCOVA	AUCO ResNet	See all
CREMA-D	EfficientLEAF	See all
RAVDESS	ASM-RH-A	See all
VocalSound	VocalSound Baseline	See all
Multimodal PISA	MMDL	See all
UCR Time Series Classification Archive	CDIL	See all
DEEP-VOICE: DeepFake Voice Recognition	XGBoost (330)	See all
EPIC-SOUNDS	Mirasol3B (A+V)	See all

Show all 23 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Audio Classification models and implementations

Sreyan88/LAPE

3 papers

towhee-io/towhee

2 papers

2,972

google-research/leaf-audio

2 papers

473

fschmid56/efficientat

2 papers

178

See all 7 libraries.

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

faceonlive/ai-research • 9 Apr 2024

However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP.

131

09 Apr 2024

Paper
Code

nEMO: Dataset of Emotional Speech in Polish

faceonlive/ai-research • 9 Apr 2024

Speech emotion recognition has become increasingly important in recent years due to its potential applications in healthcare, customer service, and personalization of dialogue systems.

131

09 Apr 2024

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo • • 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

897

22 Mar 2024

Paper
Code

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

habla-liaa/encodecmae • • 14 Feb 2024

APNet allows prototypes' reconstruction to waveforms for interpretability relying on the nearest training data samples.

14 Feb 2024

Paper
Code

Learning Audio Concepts from Counterfactual Natural Language

ali-vosoughi/counterfactual-audio • 10 Jan 2024

Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text.

10 Jan 2024

Paper
Code

Stethoscope-guided Supervised Contrastive Learning for Cross-domain Adaptation on Respiratory Sound Classification

kaen2891/stethoscope-guided_supervised_contrastive_learning • • 15 Dec 2023

Despite the remarkable advances in deep learning technology, achieving satisfactory performance in lung sound classification remains a challenge due to the scarcity of available data.

15 Dec 2023

Paper
Code

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

umbertocappellazzo/petl_ast • • 6 Dec 2023

The common modus operandi of fine-tuning large pre-trained Transformer models entails the adaptation of all their parameters (i. e., full fine-tuning).

06 Dec 2023

Paper
Code

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

jinhualiang/apt • 30 Nov 2023

Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence.

30 Nov 2023

Paper
Code

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

julirao/whisper_audio_classification • • 15 Nov 2023

Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings.

15 Nov 2023

Paper
Code

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

alibaba-damo-academy/FunASR • • 14 Nov 2023

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans.

3,115

14 Nov 2023

Paper
Code

Audio Classification

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result