48 papers with code • 7 benchmarks • 14 datasets
Audio classification or audio tagging are tasks to predict the tags of audio clips.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
Ranked #1 on Action Classification on Moments in Time (using extra training data)
Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers.
In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.
Ranked #1 on Self-Supervised Action Recognition on HMDB51
We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks.
Ranked #5 on Audio Tagging on AudioSet
The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models.
Ranked #7 on Audio Classification on AudioSet
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels.
Ranked #1 on Keyword Spotting on Speech Commands (using extra training data)
Interpretability of deep neural networks is a recently emerging area of machine learning research targeting a better understanding of how models perform feature selection and derive their classification decisions.
The objective of audio classification is to predict the presence or absence of audio events in an audio clip.