Interpretability of deep neural networks is a recently emerging area of machine learning research targeting a better understanding of how models perform feature selection and derive their classification decisions.
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio.
Traditional convolutional layers extract features from patches of data by applying a non-linearity on an affine function of the input.
Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure.
Feature learning and deep learning have drawn great attention in recent years as a way of transforming input data into more effective representations using learning algorithms.
In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and a novel audio data modality, namely acoustic images.
We present the ConditionaL Neural Network (CLNN) and the Masked ConditionaL Neural Network (MCLNN) designed for temporal signal recognition.