This indicates the strong transferability and feasibility of LLMs in the field of emotion recognition.
Despite significant advancements in deep learning for vision and natural language, unsupervised domain adaptation in audio remains relatively unexplored.
Our proposed method is significantly superior in terms of UAR and F1 to the single-task and multi-task baselines with p-values < 0. 05.
Advances in passive acoustic monitoring and machine learning have led to the procurement of vast datasets for computational bioacoustic research.
In this work, we widely study the capabilities of the ChatGPT models, namely GPT-4 and GPT-3. 5, on 13 affective computing problems, namely aspect extraction, aspect polarity classification, opinion extraction, sentiment analysis, sentiment intensity ranking, emotions intensity ranking, suicide tendency detection, toxicity detection, well-being assessment, engagement measurement, personality assessment, sarcasm detection, and subjectivity detection.
no code implementations • 22 Aug 2023 • Yuezhou Zhang, Amos A Folarin, Judith Dineley, Pauline Conde, Valeria de Angel, Shaoxiong Sun, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Petroula Laiou, Heet Sankesara, Linglong Qian, Faith Matcham, Katie M White, Carolin Oetzmann, Femke Lamers, Sara Siddi, Sara Simblett, Björn W. Schuller, Srinivasan Vairavan, Til Wykes, Josep Maria Haro, Brenda WJH Penninx, Vaibhav A Narayan, Matthew Hotopf, Richard JB Dobson, NIcholas Cummins, RADAR-CNS consortium
Our study identified 29 topics in 3919 smartphone-collected speech recordings from 265 participants using the Whisper tool and BERTopic model.
In this work, we extend this by exploring if ChatGPT has novel knowledge that would enhance existing specialised models when they are fused together.
This paper aims to serve as a comprehensive guide for researchers and practitioners, offering insights into the current state, potential applications, and future research directions for generative artificial intelligence and foundation models within the context of intelligent vehicles.
1 code implementation • 5 May 2023 • Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller, Steffen Klug, Chris Gagne, Panagiotis Tzirakis, Eva-Maria Meßner, Andreas König, Alan Cowen, Erik Cambria, Björn W. Schuller
Participants predict the presence of spontaneous humour in a cross-cultural setting.
no code implementations • 28 Apr 2023 • Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Alexander Barnhill, Maurice Gerczuk, Andreas Triantafyllopoulos, Alice Baird, Panagiotis Tzirakis, Chris Gagne, Alan S. Cowen, Nikola Lackovic, Marie-José Caraty, Claude Montacié
The ACM Multimedia 2023 Computational Paralinguistics Challenge addresses two different problems for the first time in a research competition under well-defined conditions: In the Emotion Share Sub-Challenge, a regression on speech has to be made; and in the Requests Sub-Challenges, requests and complaints need to be detected.
3 code implementations • 18 Apr 2023 • Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mingyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, Björn W. Schuller, JianHua Tao
The first Multimodal Emotion Recognition Challenge (MER 2023) was successfully held at ACM Multimedia.
We utilise three baselines, a robust language model (RoBERTa-base), a legacy word model with pretrained embeddings (Word2Vec), and a simple bag-of-words baseline (BoW).
Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing.
We then name exemplary use cases of computational charismatic skills before switching to ethical aspects and concluding this overview and perspective on building charisma-enabled AI.
Telling stories is an integral part of human communication which can evoke emotions and influence the affective states of the audience.
1 code implementation • 15 Dec 2022 • Jobie Budd, Kieran Baker, Emma Karoune, Harry Coppock, Selina Patel, Ana Tendero Cañadas, Alexander Titcomb, Richard Payne, David Hurley, Sabrina Egglestone, Lorraine Butler, Jonathon Mellor, George Nicholson, Ivan Kiskin, Vasiliki Koutra, Radka Jersakova, Rachel A. McKendry, Peter Diggle, Sylvia Richardson, Björn W. Schuller, Steven Gilmour, Davide Pigoli, Stephen Roberts, Josef Packham, Tracey Thornley, Chris Holmes
The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date.
no code implementations • 15 Dec 2022 • Davide Pigoli, Kieran Baker, Jobie Budd, Lorraine Butler, Harry Coppock, Sabrina Egglestone, Steven G. Gilmour, Chris Holmes, David Hurley, Radka Jersakova, Ivan Kiskin, Vasiliki Koutra, Jonathon Mellor, George Nicholson, Joe Packham, Selina Patel, Richard Payne, Stephen J. Roberts, Björn W. Schuller, Ana Tendero-Cañadas, Tracey Thornley, Alexander Titcomb
Since early in the coronavirus disease 2019 (COVID-19) pandemic, there has been interest in using artificial intelligence methods to predict COVID-19 infection status based on vocal audio signals, for example cough recordings.
1 code implementation • 15 Dec 2022 • Harry Coppock, George Nicholson, Ivan Kiskin, Vasiliki Koutra, Kieran Baker, Jobie Budd, Richard Payne, Emma Karoune, David Hurley, Alexander Titcomb, Sabrina Egglestone, Ana Tendero Cañadas, Lorraine Butler, Radka Jersakova, Jonathon Mellor, Selina Patel, Tracey Thornley, Peter Diggle, Sylvia Richardson, Josef Packham, Björn W. Schuller, Davide Pigoli, Steven Gilmour, Stephen Roberts, Chris Holmes
Recent work has reported that AI classifiers trained on audio recordings can accurately predict severe acute respiratory syndrome coronavirus 2 (SARSCoV2) infection status.
Our experiments demonstrate that training a lightweight SER model on the target dataset with speech samples and graphs can not only produce small SER models, but also enhance the model performance compared to models with speech samples only and those using classic transfer learning strategies.
no code implementations • 6 Oct 2022 • Andreas Triantafyllopoulos, Björn W. Schuller, Gökçe İymen, Metin Sezgin, Xiangheng He, Zijiang Yang, Panagiotis Tzirakis, Shuo Liu, Silvan Mertes, Elisabeth André, Ruibo Fu, JianHua Tao
Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human-computer interaction research.
Our findings suggest that for the automatic analysis of humour and its sentiment, facial expressions are most promising, while humour direction can be best modelled via text-based features.
Vocal bursts play an important role in communicating affect, making them valuable for improving speech emotion recognition.
We also study how 3D face shapes performed on AU intensity estimation on BP4D and DISFA datasets, and report that 3D face features were on par with 2D appearance features in AUs 4, 6, 10, 12, and 25, but not the entire set of AUs.
1 code implementation • 23 Jun 2022 • Lukas Christ, Shahin Amiriparian, Alice Baird, Panagiotis Tzirakis, Alexander Kathan, Niklas Müller, Lukas Stappen, Eva-Maria Meßner, Andreas König, Alan Cowen, Erik Cambria, Björn W. Schuller
For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that contains audio-visual recordings of German football coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities, and (iii) the Ulm-Trier Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled with continuous emotion values (arousal and valence) of people in stressful dispositions.
As compared to other existing COVID-19 sound datasets, the unique feature of the COVYT dataset is that it comprises both COVID-19 positive and negative samples from all 65 speakers.
In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction.
In particular, we impose Calibration and Ordinal Ranking constraints on the variance vectors of audiovisual latent distributions.
no code implementations • 13 May 2022 • Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Christian Bergler, Maurice Gerczuk, Natalie Holz, Pauline Larrouy-Maestri, Sebastian P. Bayerl, Korbinian Riedhammer, Adria Mallol-Ragolta, Maria Pateraki, Harry Coppock, Ivan Kiskin, Marianne Sinka, Stephen Roberts
The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human non-verbal vocalisations and speech has to be made; the Activity Sub-Challenge aims at beyond-audio human activity recognition from smartwatch sensor data; and in the Mosquitoes Sub-Challenge, mosquitoes need to be detected.
no code implementations • 10 May 2022 • Xiangheng He, Andreas Triantafyllopoulos, Alexander Kathan, Manuel Milling, Tianhao Yan, Srividya Tirunellai Rajamani, Ludwig Küster, Mathias Harrer, Elena Heber, Inga Grossmann, David D. Ebert, Björn W. Schuller
Previous studies have shown the correlation between sensor data collected from mobile phones and human depression states.
Stress is a major threat to well-being that manifests in a variety of physiological and mental symptoms.
no code implementations • 9 May 2022 • Andreas Triantafyllopoulos, Sandra Ottl, Alexander Gebhard, Esther Rituerto-González, Mirko Jaumann, Steffen Hüttner, Valerie Dieter, Patrick Schneeweiß, Inga Krauß, Maurice Gerczuk, Shahin Amiriparian, Björn W. Schuller
Although running is a common leisure activity and a core training regiment for several athletes, between $29\%$ and $79\%$ of runners sustain an overuse injury each year.
no code implementations • 6 May 2022 • Alexander Kathan, Andreas Triantafyllopoulos, Xiangheng He, Manuel Milling, Tianhao Yan, Srividya Tirunellai Rajamani, Ludwig Küster, Mathias Harrer, Elena Heber, Inga Grossmann, David D. Ebert, Björn W. Schuller
Digital health applications are becoming increasingly important for assessing and monitoring the wellbeing of people suffering from mental health conditions like depression.
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio.
Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets.
Detecting COVID-19 from audio signals, such as breathing and coughing, can be used as a fast and efficient pre-testing method to reduce the virus transmission.
Respiratory sound classification is an important tool for remote screening of respiratory-related diseases such as pneumonia, asthma, and COVID-19.
Emotional voice conversion (EVC) focuses on converting a speech utterance from a source to a target emotion; it can thus be a key enabling technology for human-computer interaction applications and beyond.
To clearly understand the performance differences between recurrent and attention models in audiovisual affect recognition, we present a comprehensive evaluation of fusion models based on LSTM-RNNs, self-attention and cross-modal attention, trained for valence and arousal estimation.
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks.
Emotion and a broader range of affective driver states can be a life decisive factor on the road.
no code implementations • 10 Mar 2022 • Björn W. Schuller, Alican Akman, Yi Chang, Harry Coppock, Alexander Gebhard, Alexander Kathan, Esther Rituerto-González, Andreas Triantafyllopoulos, Florian B. Pokorny
We categorise potential computer audition applications according to the five elements of earth, water, air, fire, and aether, proposed by the ancient Greeks in their five element theory; this categorisation serves as a framework to discuss computer audition in relation to different ecological aspects.
The proposed framework consists of i) federated learning for data privacy, and ii) adversarial training at the training stage and randomisation at the testing stage for model robustness.
3 code implementations • 6 Mar 2022 • Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk
The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios.
no code implementations • 17 Feb 2022 • Harry Coppock, Alican Akman, Christian Bergler, Maurice Gerczuk, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia, Pietro Cicuta, Jing Han, Shahin Amiriparian, Alice Baird, Lukas Stappen, Sandra Ottl, Panagiotis Tzirakis, Anton Batliner, Cecilia Mascolo, Björn W. Schuller
The COVID-19 pandemic has caused massive humanitarian and economic damage.
We present a theoretical analysis of the method, in addition to an empirical comparison against two standard methods for fairness, namely data balancing and adversarial training.
As desired, the proposed network controls the fine-grained emotion intensity in the output speech.
To show the favourable properties of our pre-trained model on modelling facial affect, we use the RECOLA database, and compare with the current state-of-the-art approach.
In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER).
Although these factors play a well-understood role in the performance of ASC models, most works report single evaluation metrics taking into account all different strata of a particular dataset.
We report on cross-running the recent COVID-19 Identification ResNet (CIdeR) on the two Interspeech 2021 COVID-19 diagnosis from cough and speech audio challenges: ComParE and DiCOVA.
We utilise a Long Short-Term Memory, Recurrent Neural Network to explore the benefit of fusing these physiological signals with arousal as the target, learning from various audio, video, and textual based features.
With this in mind, the MuSe-Toolbox provides the functionality to run exhaustive searches for meaningful class clusters in the continuous gold standards.
Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to a target style while preserving its content and speaker identity information.
Images can convey rich semantics and induce various emotions in viewers.
The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning.
To unfold the tremendous amount of multimedia data uploaded daily to social media platforms, effective topic modeling techniques are needed.
In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm.
By obtaining state-of-the-art results on a set of paralinguistics tasks, we demonstrate the suitability of the proposed transfer learning approach for embedded audio signal processing, even when data is scarce.
no code implementations • 20 Apr 2021 • Shahin Amiriparian, Artem Sokolov, Ilhan Aslan, Lukas Christ, Maurice Gerczuk, Tobias Hübner, Dmitry Lamanov, Manuel Milling, Sandra Ottl, Ilya Poduremennykh, Evgeniy Shuranov, Björn W. Schuller
Text encodings from automatic speech recognition (ASR) transcripts and audio representations have shown promise in speech emotion recognition (SER) ever since.
Multimodal Sentiment Analysis (MuSe) 2021 is a challenge focusing on the tasks of sentiment and emotion, as well as physiological-emotion and emotion-based stress recognition through more comprehensively integrating the audio-visual, language, and biological signal modalities.
Recently, extensive research efforts have been dedicated to understanding the emotions of images.
In this paper, we propose a novel framework that can capture both the semantic and the paralinguistic information in the signal.
Speech Emotion Recognition Sound Audio and Speech Processing
no code implementations • 24 Feb 2021 • Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya, Shahin Amiriparian, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia, Pietro Cicuta, Leon J. M. Rothkrantz, Joeri Zwerts, Jelle Treep, Casper Kaandorp
The INTERSPEECH 2021 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the COVID-19 Cough and COVID-19 Speech Sub-Challenges, a binary classification on COVID-19 infection has to be made based on coughing sounds and speech; in the Escalation SubChallenge, a three-way assessment of the level of escalation in a dialogue is featured; and in the Primates Sub-Challenge, four species vs background need to be classified.
Our main contributions are as follows: (I) We demonstrate the first attempt to diagnose COVID-19 using end-to-end deep learning from a crowd-sourced dataset of audio samples, achieving ROC-AUC of 0. 846; (II) Our model, the COVID-19 Identification ResNet, (CIdeR), has potential for rapid scalability, minimal cost and improving performance as more data becomes available.
In addition, extended learning period is a general challenge for deep RL which can impact the speed of learning for SER.
The COVID-19 pandemic has affected the world unevenly; while industrial economies have been able to produce the tests necessary to track the spread of the virus and mostly avoided complete lockdowns, developing countries have faced issues with testing capacity.
no code implementations • 17 Dec 2020 • Katrin D. Bartl-Pokorny, Florian B. Pokorny, Anton Batliner, Shahin Amiriparian, Anastasia Semertzidou, Florian Eyben, Elena Kramer, Florian Schmidt, Rainer Schönweiler, Markus Wehler, Björn W. Schuller
Group differences in the front vowels /i:/ and /e:/ are additionally reflected in the variation of the fundamental frequency and the harmonics-to-noise ratio, group differences in back vowels /o:/ and /u:/ in statistics of the Mel-frequency cepstral coefficients and the spectral slope.
To the best of our knowledge, MeDaS is the first open-source platform proving a collaborative and interactive service for researchers from a medical background easily using DL related toolkits, and at the same time for scientists or engineers from information sciences to understand the medical knowledge side.
Systems for the automatic recognition and detection of automotive parts are crucial in several emerging research areas in the development of intelligent vehicles.
Hence, with the extensive experimental results, we have demonstrated that by harnessing the power of the high-fidelity audio generation, the proposed GAAE model can learn powerful representation from unlabelled dataset leveraging a fewer percentage of labelled data as supervision/guidance.
In this mini-survey, we review all the literature we found to date, that attempt to solve the packet-loss in speech using deep learning methods.
Additionally, extending this with an end-to-end emotion prediction neural network provides a network that performs SER from audio with lost frames, end-to-end.
On the development partition of the data, we achieve Spearman's correlation coefficients of . 324, . 283, and . 320 with the targets on the Karolinska Sleepiness Scale by utilising attention and non-attention autoencoders, and the fusion of both autoencoders' representations, respectively.
To the best of our knowledge, it is the first public toolkit assembling a series of state-of-the-art deep learning technologies.
no code implementations • 30 Apr 2020 • Jing Han, Kun Qian, Meishu Song, Zijiang Yang, Zhao Ren, Shuo Liu, Juan Liu, Huaiyuan Zheng, Wei Ji, Tomoya Koike, Xiao Li, Zixing Zhang, Yoshiharu Yamamoto, Björn W. Schuller
In particular, by analysing speech recordings from these patients, we construct audio-only-based models to automatically categorise the health state of patients from four aspects, including the severity of illness, sleep quality, fatigue, and anxiety.
1 code implementation • 30 Apr 2020 • Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn W. Schuller, Iulia Lefter, Erik Cambria, Ioannis Kompatsiaris
Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthiness detection by means of more comprehensively integrating the audio-visual and language modalities.
We come to the conclusion that CA appears ready for implementation of (pre-)diagnosis and monitoring tools, and more generally provides rich and significant, yet so far untapped potential in the fight against COVID-19 spread.
Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions.
Deep reinforcement learning (deep RL) is a combination of deep learning with reinforcement learning principles to create efficient methods that can learn by interacting with its environment.
We introduce openXBOW, an open-source toolkit for the generation of bag-of-words (BoW) representations from multimodal input.