The performance of machine learning algorithms is known to be negatively affected by possible mismatches between training (source) and test (target) data distributions.
The behavior of the block that implements such operators and, therefore, the entire neural network, can be modified depending on the input to the block, the established residual configurations and the selected non-linear activations.
Although works have been done in using HPSS as input representation for CNN model in ASC task, this paper further investigate the possibility on leveraging the separated harmonic component and percussive component by curating 2 CNNs which tries to understand harmonic audio and percussive audio in their natural form, one specialized in extracting deep features in time biased domain and another specialized in extracting deep features in frequency biased domain, respectively.
Convolutional neural networks are widely adopted in Acoustic Scene Classification (ASC) tasks, but they generally carry a heavy computational burden.
Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC).
In this paper, we address the task of characterizing acoustic scenes in a workplace setting from audio recordings collected with wearable microphones.
We present CP-JKU submission to MediaEval 2019; a Receptive Field-(RF)-regularized and Frequency-Aware CNN approach for tagging music with emotion/mood labels.
Results have shown that cross-task pre-training mechanism can significantly improve the performance of ASC tasks and the performance of our best model improved relatively 9. 5% in the TAU Urban Acoustic Scenes 2019 dataset, and also improved 10% in the TUT Acoustic Scenes 2017 dataset compared with the official baseline.
With this loss function, the samples from the same audio scene are clustered independently of the environment, and thus we can get the classifier with better generalization ability in an unseen environment.
Acoustic Scene Classification (ASC) is a challenging task, as a single scene may involve multiple events that contain complex sound patterns.