Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging.
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data.
The results indicate that training the model on multimodal data does positively influence performance when tested on unimodal data.
Notably, we achieved the Top-1 performance in Task 2-1 and Task 2-2 with the highest Score of 74. 5% and 53. 9%, respectively.
Multimodal emotion recognition (MER) is a fundamental complex research problem due to the uncertainty of human emotional expression and the heterogeneity gap between different modalities.
Accurately detecting emotions in conversation is a necessary yet challenging task due to the complexity of emotions and dynamics in dialogues.
In this work, we show that while encoding the logic of a whole sleep cycle is crucial to improve sleep staging performance, the sequential modelling approach in existing state-of-the-art deep learning models are inefficient for that purpose.
In this paper, we do a comprehensive analysis of improvement in sound source localization by combining the direction of arrivals (DOAs) with their derivatives which quantify the changes in the positions of sources over time.
Considering the co-importance of model compactness and robustness in practical applications, several prior works have explored to improve the adversarial robustness of the sparse neural networks.
Deep learning approaches for black-box modelling of audio effects have shown promise, however, the majority of existing work focuses on nonlinear effects with behaviour on relatively short time-scales, such as guitar amplifiers and distortion.
The results show that the proposed model is promising to achieve personalized longitudinal MS assessment; they also suggest that features related to gait and balance as well as upper extremity function, remotely collected from sensor-based assessments, may be useful digital markers for predicting MS over time.
Recently backdoor attack has become an emerging threat to the security of deep neural network (DNN) models.
The network is composed of a backbone subnet and multiple task-specific subnets.
This method consists of training a model with larger amounts of data from the source modality and few paired samples of source and target modality.
In this work, we introduce SALSA-Lite, a fast and effective feature for polyphonic SELD using microphone array inputs.
Filter pruning has been widely used for neural network compression because of its enabled practical acceleration.
Background: Despite the tremendous progress recently made towards automatic sleep staging in adults, it is currently unknown if the most advanced algorithms generalize to the pediatric population, which displays distinctive characteristics in overnight polysomnography (PSG).
It is based on the transformer backbone and offers interpretability of the model's decisions at both the epoch and sequence level.
Modern sleep monitoring development is shifting towards the use of unobtrusive sensors combined with algorithms for automatic sleep scoring.
The learned embedding in the subnetworks are then concatenated to form the multi-view embedding for classification similar to a simple concatenation network.
1 code implementation • 7 Feb 2021 • Phairot Autthasan, Rattanaphon Chaisaen, Thapanun Sudhawiyangkul, Phurin Rangpong, Suktipol Kiatthaveephong, Nat Dilokthanakul, Gun Bhakdisongkhram, Huy Phan, Cuntai Guan, Theerawit Wilaiprasitporn
We integrate deep metric learning into a multi-task autoencoder to learn a compact and discriminative latent representation from EEG and perform classification simultaneously.
This paper presents an inception-based deep neural network for detecting lung diseases using respiratory sound input.
Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation, which may obscure temporal dependencies across the sequence input.
Audio event localization and detection (SELD) have been commonly tackled using multitask models.
This work proposes a sequence-to-sequence sleep staging model, XSleepNet, that is capable of learning a joint representation from both raw signals and time-frequency images.
Ranked #1 on Sleep Stage Detection on SHHS
We employ the pretrained SeqSleepNet (i. e. the subject independent model) as a starting point and finetune it with the single-night personalization data to derive the personalized model.
1 code implementation • 8 Apr 2020 • Nannapas Banluesombatkul, Pichayoot Ouppaphan, Pitshaporn Leelaarporn, Payongkit Lakhan, Busarakum Chaitusaney, Nattapong Jaimchariyatam, Ekapol Chuangsuwanich, Wei Chen, Huy Phan, Nat Dilokthanakul, Theerawit Wilaiprasitporn
This is the first work that investigated a non-conventional pre-training method, MAML, resulting in a possibility for human-machine collaboration in sleep stage classification and easing the burden of the clinicians in labelling the sleep stages through only several epochs rather than an entire recording.
This paper presents and explores a robust deep learning framework for auscultation analysis.
This paper presents a robust deep learning framework developed to detect respiratory diseases from recordings of respiratory sounds.
The former constrains the generators to learn a common mapping that is iteratively applied at all enhancement stages and results in a small model footprint.
In addition, CAG exhibits high transferability across different DNN classifier models in black-box attack scenario by introducing random dropout in the process of generating perturbations.
We employ the Montreal Archive of Sleep Studies (MASS) database consisting of 200 subjects as the source domain and study deep transfer learning on three different target domains: the Sleep Cassette subset and the Sleep Telemetry subset of the Sleep-EDF Expanded database, and the Surrey-cEEGrid database.
Ranked #1 on Multimodal Sleep Stage Detection on Surrey-PSG
This work presents a deep transfer learning approach to overcome the channel mismatch problem and transfer knowledge from a large dataset to a small cohort to study automatic sleep staging with single-channel input.
Acoustic scenes are rich and redundant in their content.
We propose a multi-label multi-task framework based on a convolutional recurrent neural network to unify detection of isolated and overlapping audio events.
Moreover, as model fusion with deep network ensemble is prevalent in audio scene classification, we further study whether, and if so, when model fusion is necessary for this task.
At the sequence processing level, a recurrent layer placed on top of the learned epoch-wise features for long-term modelling of sequential epochs.
While the proposed framework is orthogonal to the widely adopted classification schemes, which take one or multiple epochs as contextual inputs and produce a single classification decision on the target epoch, we demonstrate its advantages in several ways.
Ranked #2 on Sleep Stage Detection on MASS SS2
The proposed system consists of a novel inference step coupled with dual parallel tailored-loss deep neural networks (DNNs).
Our proposed systems significantly outperform the challenge baseline, improving F-score from 72. 7% to 90. 0% and reducing detection error rate from 0. 53 to 0. 18 on average on the development data.
We introduce in this work an efficient approach for audio scene classification using deep recurrent neural networks.
We trained a deep all-convolutional neural network with masked global pooling to perform single-label classification for acoustic scene classification and multi-label classification for domestic audio tagging in the DCASE-2016 contest.
The regression phase is then carried out to let the positive audio segments vote for the event onsets and offsets, and therefore model the temporal structure of audio events.
This category taxonomy is then used in the feature extraction step in which an audio scene instance is represented by a label tree embedding image.
We present in this paper an efficient approach for acoustic scene classification by exploring the structure of class labels.
The entries of the descriptor are produced by evaluating a set of regressors on the input signal.
We present in this paper a simple, yet efficient convolutional neural network (CNN) architecture for robust audio event recognition.