Zero-shot Audio Classification

9 papers with code • 4 benchmarks • 4 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

pku-yuangroup/languagebind 3 Oct 2023

We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

xinhaomei/wavcaps 30 Mar 2023

To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.

ImageBind: One Embedding Space To Bind Them All

facebookresearch/imagebind CVPR 2023

We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.

Sound-Guided Semantic Image Manipulation

kuai-lab/sound-guided-semantic-image-manipulation CVPR 2022

Our audio encoder is trained to produce a latent representation from an audio input, which is forced to be aligned with image and text representations in the multi-modal embedding space.

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

zhaoyanpeng/vipant NAACL 2022

In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2. 2\% R@1.

Exploring Meta Information for Audio-based Zero-shot Bird Classification

atriantafyllopoulos/audiocub-zsl 15 Sep 2023

Advances in passive acoustic monitoring and machine learning have led to the procurement of vast datasets for computational bioacoustic research.

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

julirao/whisper_audio_classification 15 Nov 2023

Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings.

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

Camille112/T-CLAP 17 Aug 2024

In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

sreyan88/reclap 13 Sep 2024

To achieve this, we first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild.