TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Environmental Sound Classification	ESC-50	AudioCLIP	Accuracy	97.15	# 1
Zero-Shot Environment Sound Classification	ESC-50	AudioCLIP (partial training)	Accuracy	69.40	# 4
Environmental Sound Classification	UrbanSound8K	AudioCLIP	Accuracy	90.07	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audioclip-extending-clip-to-image-text-and/environmental-sound-classification-on-esc-50)](https://paperswithcode.com/sota/environmental-sound-classification-on-esc-50?p=audioclip-extending-clip-to-image-text-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audioclip-extending-clip-to-image-text-and/zero-shot-environment-sound-classification-on-1)](https://paperswithcode.com/sota/zero-shot-environment-sound-classification-on-1?p=audioclip-extending-clip-to-image-text-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audioclip-extending-clip-to-image-text-and/environmental-sound-classification-on)](https://paperswithcode.com/sota/environmental-sound-classification-on?p=audioclip-extending-clip-to-image-text-and)`

AudioCLIP: Extending CLIP to Image, Text and Audio

24 Jun 2021 · Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel ·

In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets 68.78% and 69.40%, respectively). Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.

PDF Abstract

Code

Add Remove Mark official

AndreyGuzhov/AudioCLIP official

707

iver56/audiomentations

1,688

asteroid-team/torch-audiomentations

876

julirao/whisper_audio_classification

Tasks

Add Remove

Classification

Environmental Sound Classification

Sound Classification

Zero-Shot Environment Sound Classification

Datasets

AudioSet

ESC-50

UrbanSound8K

Results from the Paper

Edit

Ranked #1 on Environmental Sound Classification on ESC-50

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Environmental Sound Classification	ESC-50	AudioCLIP	Accuracy	97.15	# 1	Compare
Zero-Shot Environment Sound Classification	ESC-50	AudioCLIP (partial training)	Accuracy	69.40	# 4	Compare
Environmental Sound Classification	UrbanSound8K	AudioCLIP	Accuracy	90.07	# 5	Compare

Methods

Add Remove

CLIP

Edit Social Preview

AudioCLIP: Extending CLIP to Image, Text and Audio

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove