Upsampling artifacts are caused by problematic upsampling layers and due to spectral replicas that emerge while upsampling.
Version identification (VI) systems now offer accurate and scalable solutions for detecting different renditions of a musical composition, allowing the use of these systems in industrial applications and throughout the wider music ecosystem.
The setlist identification (SLI) task addresses a music recognition use case where the goal is to retrieve the metadata and timestamps for all the tracks played in live music events.
We then compare different upsampling layers, showing that nearest neighbor upsamplers can be an alternative to the problematic (but state-of-the-art) transposed and subpixel convolutions which are prone to introduce tonal artifacts.
Applications of deep learning to automatic multitrack mixing are largely unexplored.
Audio and Speech Processing Sound
Version identification systems aim to detect different renditions of the same underlying musical composition (loosely called cover songs).
Automatic speech quality assessment is an important, transversal task whose progress is hampered by the scarcity of human annotations, poor generalization to unseen recording conditions, and a lack of flexibility of existing approaches.
The version identification (VI) task deals with the automatic detection of recordings that correspond to the same underlying musical piece.
Ranked #3 on Cover song identification on Covers80
Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system.
Ranked #3 on Anomaly Detection on Unlabeled CIFAR-10 vs CIFAR-100
End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.
Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure.
Ranked #2 on Distant Speech Recognition on DIRHA English WSJ
The speech enhancement task usually consists of removing additive noise or reverberation that partially mask spoken utterances, affecting their intelligibility.
We investigate supervised learning strategies that improve the training of neural network audio classifiers on small annotated collections.
Most methods of voice restoration for patients suffering from aphonia either produce whispered or monotone speech.
The conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks.
no code implementations • 7 Jun 2018 • Emilia Gómez, Carlos Castillo, Vicky Charisi, Verónica Dahl, Gustavo Deco, Blagoj Delipetrev, Nicole Dewandre, Miguel Ángel González-Ballester, Fabien Gouyon, José Hernández-Orallo, Perfecto Herrera, Anders Jonsson, Ansgar Koene, Martha Larson, Ramón López de Mántaras, Bertin Martens, Marius Miron, Rubén Moreno-Bote, Nuria Oliver, Antonio Puertas Gallardo, Heike Schweitzer, Nuria Sebastian, Xavier Serra, Joan Serrà, Songül Tolan, Karina Vold
The workshop gathered an interdisciplinary group of experts to establish the state of the art research in the field and a list of future research challenges to be addressed on the topic of human and machine intelligence, algorithm's potential impact on human cognitive capabilities and decision making, and evaluation and regulation needs.
We evaluate the performance of the proposed approach on a well-known time series classification benchmark, considering full adaptation, partial adaptation, and no adaptation of the encoder to the new data type.
In this paper, we propose a task-based hard attention mechanism that preserves previous tasks' information without affecting the current task's learning.
Besides using classical gradient-boosted trees, we demonstrate how to make continual predictions using a recurrent neural network (RNN).
In this work, we present the results of adapting a speech enhancement generative adversarial network by finetuning the generator with small amounts of data.
Due to the structure of the data coming from recommendation domains (i. e., one-hot-encoded vectors of item preferences), these algorithms tend to have large input and output dimensionalities that dominate their overall size.
We present a practical approach for processing mobile sensor time series data for continual deep learning predictions.
Our results indicate that, compared to the best baseline, tree-based models can deliver up to 14% better forecasts for regular hot spots and 153% better forecasts for non-regular hot spots.
In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
Finding repeated patterns or motifs in a time series is an important unsupervised task that has still a number of open issues, starting by the definition of motif.
Specifically, we find that length-normalized motif dissimilarities still have intrinsic dependencies on the motif length, and that lowest dissimilarities are particularly affected by this dependency.
In this article, we propose an innovative standpoint and present a solution coming from it: an anytime multimodal optimization algorithm for time series motif discovery based on particle swarms.
In particular, the similarity measure is the most essential ingredient of time series clustering and classification systems.