We investigate features extracted from these signals against various user engagement indicators including views, like/dislike ratio, as well as the sentiment of comments.
no code implementations • 19 Apr 2021 • Shuo Liu, Jing Han, Estela Laporta Puyal, Spyridon Kontaxis, Shaoxiong Sun, Patrick Locatelli, Judith Dineley, Florian B. Pokorny, Gloria Dalla Costa, Letizia Leocan, Ana Isabel Guerrero, Carlos Nos, Ana Zabalza, Per Soelberg Sørensen, Mathias Buron, Melinda Magyari, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Callum Stewart, Amos A Folarin, Richard JB Dobson, Raquel Bailón, Srinivasan Vairavan, NIcholas Cummins, Vaibhav A Narayan, Matthew Hotopf, Giancarlo Comi, Björn Schuller
This study investigates the potential of deep learning methods to identify individuals with suspected COVID-19 infection using remotely collected heart-rate data.
The corpus is then utilised to create a novel framework for multi-corpus speech emotion recognition, namely EmoNet.
This paper addresses these shortcomings by proposing a novel model that efficiently extracts both spatial and temporal features of the data by means of its enhanced temporal modelling based on latent features.
Truly real-life data presents a strong, but exciting challenge for sentiment and emotion research.
A potential approach to tackling this is Federated Learning (FL), which enables multiple parties to collaboratively learn a shared prediction model by using parameters of locally trained models while keeping raw training data locally.
Recently, there has been an increased attention towards innovating, enhancing, building, and deploying applications of speech signal processing for providing assistance and relief to human mankind from the Coronavirus (COVID-19) pandemic.
Computers and Society Sound Audio and Speech Processing
Detecting hate speech, especially in low-resource languages, is a non-trivial challenge.
However, the model can become redundant if it is intended for a specific task.
However, the use of latent features, which is feasible through adversarial learning, is not largely explored, yet.
N-HANS is a Python toolkit for in-the-wild audio enhancement, including speech, music, and general audio denoising, separation, and selective noise or source suppression.
Sound Audio and Speech Processing
The Poisson equation is commonly encountered in engineering, for instance in computational fluid dynamics (CFD) where it is needed to compute corrections to the pressure field to ensure the incompressibility of the velocity field.
In this article, we study laughter found in child-robot interaction where it had not been prompted intentionally.
Motivated by this, we propose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowledge from other auxiliary modalities to improve the performance of an emotion recognition system at hand.
no code implementations • 10 Jul 2019 • Fabien Ringeval, Björn Schuller, Michel Valstar, NIcholas Cummins, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin Amiriparian, Eva-Maria Messner, Siyang Song, Shuo Liu, Ziping Zhao, Adria Mallol-Ragolta, Zhao Ren, Mohammad Soleymani, Maja Pantic
The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) "State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition" is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions.
We present a novel source separation model to decompose asingle-channel speech signal into two speech segments belonging to two different speakers.
Synthesising 3D facial motion from speech is a crucial problem manifesting in a multitude of applications such as computer games and movies.
Generative Adversarial Networks (GANs) have become exceedingly popular in a wide range of data-driven research fields, due in part to their success in image generation.
Despite its drawbacks, $MSE$ is one of the most popular performance metrics (and a loss function); along with lately $\rho_c$ in many of the sequence prediction challenges.
We address the problem of speech enhancement generalisation to unseen environments by performing two manipulations.
Over the past few years, adversarial training has become an extremely active research topic and has been successfully applied to various Artificial Intelligence (AI) domains.
The performance of speaker-related systems usually degrades heavily in practical applications largely due to the presence of background noise.
This paper describes audEERING's submissions as well as additional evaluations for the One-Minute-Gradual (OMG) emotion recognition challenge.
Automatic understanding of human affect using visual signals is of great importance in everyday human-machine interactions.
Despite their obvious aforementioned advantage in relation to accuracy, contemporary neural networks can, generally, be regarded as poorly calibrated and as such do not produce reliable output probability estimates.
Scientific disciplines, such as Behavioural Psychology, Anthropology and recently Social Signal Processing are concerned with the systematic exploration of human behaviour.
Neural network models that are not conditioned on class identities were shown to facilitate knowledge transfer between classes and to be well-suited for one-shot learning tasks.
auDeep is a Python toolkit for deep unsupervised representation learning from acoustic data.
Sound Audio and Speech Processing
Extensive evaluation on a large-size acoustic event database is performed, and the empirical results demonstrate that the learnt audio sequence representation yields a significant performance improvement by a large margin compared with other state-of-the-art hand-crafted sequence features for AEC.
Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge.
Our experiments show that indeed in almost all cases, losses that are aligned with the Principle of Logit Separation obtain at least 20% relative accuracy improvement in the SLC task compared to losses that are not aligned with it, and sometimes considerably more.
The system is then trained in an end-to-end fashion where - by also taking advantage of the correlations of the each of the streams - we manage to significantly outperform the traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.
The goal of this paper is to model these structures and estimate complex feature representations simultaneously by combining conditional random field (CRF) encoded AU dependencies with deep learning.
We propose incorporating this idea of tunable sensitivity for hard examples in neural network learning, using a new generalization of the cross-entropy gradient step, which can be used in place of the gradient in any gradient-based training method.
Traditional convolutional layers extract features from patches of data by applying a non-linearity on an affine function of the input.
We introduce a recurrent neural network architecture for automated road surface wetness detection from audio of tire-surface interaction.
Transcription of broadcast news is an interesting and challenging application for large-vocabulary continuous speech recognition (LVCSR).
The goal of the system is to analyse sounds emitted by walking persons (mostly the step sounds) and identify those persons.
Individuals with Autism Spectrum Conditions (ASC) have marked difficulties using verbal and non-verbal communication for social interaction.
This volume contains the papers accepted at the 6th International Symposium on Attention in Cognitive Systems (ISACS 2013), held in Beijing, August 5, 2013.