How hosts language influence their pets' vocalization is an interesting yet underexplored problem.
This study presents a data-driven investigation into the semantics of dog vocalizations via correlating different sound types with consistent semantics.
To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1. 9M audio-text pairs.
Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips.
Empowering chatbots in the field of mental health is receiving increasing amount of attention, while there still lacks exploration in developing and evaluating chatbots in psychiatric outpatient scenarios.
Controlling chatbot utterance generation with multiple attributes such as personalities, emotions and dialogue acts is a practically useful but under-studied problem.
The second phase is to fine-tune the pretrained model on the TOD data.
In this work, we aim to build a unified dialogue foundation model (DFM) which can be used to solve massive diverse dialogue tasks.
In a depression-diagnosis-directed clinical session, doctors initiate a conversation with ample emotional support that guides the patients to expose their symptoms based on clinical diagnosis criteria.
Mental disease detection (MDD) from social media has suffered from poor generalizability and interpretability, due to lack of symptom modeling.
Depression is a prominent health challenge to the world, and early risk detection (ERD) of depression from online posts can be a promising technique for combating the threat.
Automatic depression detection has attracted increasing amount of attention but remains a challenging task.
Current metrics are found in poor correlation with human annotations on these datasets.
This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6.
Ranked #2 on Audio captioning on Clotho (using extra training data)
A dual learning approach is also proposed for the utterance rewrite model to address the data sparsity problem.
Our model outperforms other approaches on the DCASE2018 and URBAN-SED datasets without requiring prior duration knowledge.
How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations.
This paper proposes a method to disentangle and quantify interactions among words that are encoded inside a DNN for natural language processing.
We proposed two GPVAD models, one full (GPV-F), trained on 527 Audioset sound events, and one binary (GPV-B), only distinguishing speech and noise.
Sound Audio and Speech Processing
Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning.