Audio Classification
131 papers with code • 23 benchmarks • 34 datasets
Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds.
Libraries
Use these libraries to find Audio Classification models and implementationsDatasets
Subtasks
Latest papers
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP.
nEMO: Dataset of Emotional Speech in Polish
Speech emotion recognition has become increasingly important in recent years due to its potential applications in healthcare, customer service, and personalization of dialogue systems.
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio
APNet allows prototypes' reconstruction to waveforms for interpretability relying on the nearest training data samples.
Learning Audio Concepts from Counterfactual Natural Language
Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text.
Stethoscope-guided Supervised Contrastive Learning for Cross-domain Adaptation on Respiratory Sound Classification
Despite the remarkable advances in deep learning technology, achieving satisfactory performance in lung sound classification remains a challenge due to the scarcity of available data.
Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers
The common modus operandi of fine-tuning large pre-trained Transformer models entails the adaptation of all their parameters (i. e., full fine-tuning).
Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities
Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence.
Investigating the Emergent Audio Classification Ability of ASR Foundation Models
Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Recently, instruction-following audio-language models have received broad attention for audio interaction with humans.