Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary.
Accent plays a significant role in speech communication, influencing understanding capabilities and also conveying a person's identity.
In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).
Our results show that the model outperforms existing state-of-the-art models when forecasting extreme volatility spikes for Bitcoin using CryptoQuant data as well as whale-alert tweets.
However, its novelty necessitates a new perspective on how to evaluate such a model.
In our hybrid model, we use sentence-level FinBERT embeddings, pretrained on financial lexicons, so as to capture the full contents of the tweets and feed it to the model in an understandable way.
2 code implementations • 6 Mar 2022 • Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk
The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios.
In this paper we present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute to the perceived emotion of media.
We present a novel music generation framework for music infilling, with a user friendly interface.
The field of automatic music composition has seen great progress in recent years, specifically with the invention of transformer-based architectures.
The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately.
Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize.
This provides a unique and integrated approach that guides managers and lead developers through the various challenges in the implementation process.
The resulting trained model when trained for classifying two classes of coughs -- healthy or pathology (in general or belonging to a specific respiratory pathology), reaches accuracy exceeding 84\% when classifying cough to the label provided by the physicians' diagnosis.
In this paper, we present a novel approach for calculating the valence (the positivity or negativity of the perceived emotion) of a chord progression within a lead sheet, using pre-defined mood tags proposed by music experts.
In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet.
We attempt to use only the pitch labels (together with spectrogram reconstruction loss) and explore how far this model can go without introducing supervised sub-tasks.
Many of the music generation systems based on neural networks are fully autonomous and do not offer control over the generation process.
Sound Symbolic Computation Audio and Speech Processing
We use this new dataset to train different classification models to distinguish the origin of the music in terms of these ethnic groups.
Using arousal as an example of a high-level feature, we show that the "faders" of our model are disentangled and change linearly w. r. t.
Information on liquid jet stream flow is crucial in many real world applications.
We present a controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE), which can generate realistic piano performances in the audio domain that closely follows temporal conditions of two essential style features for piano performances: articulation and dynamics.
This paper thoroughly analyses the effect of different input representations on polyphonic multi-instrument music transcription.
Sound Audio and Speech Processing
First, it takes a lot of hard disk space to store different frequency domain representations.
We propose a flexible framework that deals with both singer conversion and singers vocal technique conversion.
We present a Python library, called Midi Miner, that can calculate tonal tension and classify different tracks.
When reducing the training data to only using the train set, our method results in 309 confusions for the Multi-target speaker identification task, which is 46% better than the baseline model.
Interestingly, we also observe that the optical flow is more informative than the RGB in videos, and overall, models using audio features are more accurate than those based on video features when making the final prediction of evoked emotions.
The proposed method comprises of a ML based feature extraction method and classification technique.
Specifically, we use two separate encoders to learn distinct latent spaces for timbre and pitch, which form Gaussian mixture components representing instrument identity and pitch, respectively.
Finally, we evaluate the performance of our robust replay speaker detection system with a wide variety and different combinations of both extracted and machine learned audio features on the `out in the wild' ASVspoof 2017 dataset.
MorpheuS' novel framework has the ability to generate polyphonic pieces with a given tension profile and long- and short-term repeated pattern structures.
Sound Audio and Speech Processing
We present a unique neural network approach inspired by a technique that has revolutionized the field of vision: pixel-wise image classification, which we combine with cross entropy loss and pretraining of the CNN as an autoencoder on singing voice spectrograms.
In this newly learned vector space, a metric based on cosine distance is able to distinguish between functional chord relationships, as well as harmonic associations in the music.
A visualization of the reduced vector space using t-distributed stochastic neighbor embedding shows that the resulting embedded vector space captures tonal relationships, even without any explicit information about the musical contents of the slices.