Our approach used a teacher-student framework to transfer knowledge from a larger, more complex model to a smaller, light-weight model using dual-view cross-correlation distillation and the teacher's codebook as learning objectives.
Self-supervised speech representation learning (S3RL) is revolutionizing the way we leverage the ever-growing availability of data.
We investigate the use of the mapping-based method in the time domain and show that it can perform better on a large training set than the masking-based method.
Besides, leveraging our density map generation method, we propose an iterative distillation algorithm to progressively enhance our model with identical network structures, without significantly sacrificing the dimension of the output density maps.
To improve the robustness, a speech enhancement front-end is involved.
Most current speech enhancement models use spectrogram features that require an expensive transformation and result in phase information loss.
The proposed hybrid attention architecture helps the system focus on learning informative representations for both modality-specific feature extraction and model fusion.
Multimodal affective computing, learning to recognize and interpret human affects and subjective information from multiple data sources, is still challenging because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract level, ignoring time-dependent interactions between modalities.
In this paper, we present a novel deep multimodal framework to predict human emotions based on sentence-level spoken language.
For the Olympic swimming dataset, our system achieved an accuracy of 88%, an F1-score of 0. 58, a completeness estimation error of 6. 3% and a remaining-time estimation error of 2. 9 minutes.