Zero-shot Audio Classification
9 papers with code • 4 benchmarks • 4 datasets
Most implemented papers
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
ImageBind: One Embedding Space To Bind Them All
We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.
Sound-Guided Semantic Image Manipulation
Our audio encoder is trained to produce a latent representation from an audio input, which is forced to be aligned with image and text representations in the multi-modal embedding space.
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer
In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2. 2\% R@1.
Exploring Meta Information for Audio-based Zero-shot Bird Classification
Advances in passive acoustic monitoring and machine learning have led to the procurement of vast datasets for computational bioacoustic research.
Investigating the Emergent Audio Classification Ability of ASR Foundation Models
Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings.
Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs
In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
To achieve this, we first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild.