Audio Classification
133 papers with code • 20 benchmarks • 35 datasets
Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds.
Libraries
Use these libraries to find Audio Classification models and implementationsDatasets
Subtasks
Most implemented papers
Augmenting Deep Classifiers with Polynomial Neural Networks
The efficacy of the proposed models is evaluated on standard image and audio classification benchmarks.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
Efficient Training of Audio Transformers with Patchout
However, one of the main shortcomings of transformer models, compared to the well-established CNNs, is the computational complexity.
SSAST: Self-Supervised Audio Spectrogram Transformer
However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST.
CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
Audio classification is an active research area with a wide range of applications.
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation
We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of . 483 mAP on AudioSet.
Audiovisual Masked Autoencoders
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning?
BEATs: Audio Pre-Training with Acoustic Tokenizers
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
In this work, we explore a scalable way for building a general representation model toward unlimited modalities.