For the HRTF data, we use truncated spherical harmonic (SH) coefficients to represent the HRTF magnitudes and onsets.
This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking.
Fully-supervised models for source separation are trained on parallel mixture-source data and are currently state-of-the-art.
This naturally enforces playability constraints for guitar, and yields tablature which is more consistent with the symbolic data used to estimate pairwise likelihoods.
When formulating piano transcription in this way, we eliminate the need to rely on disjoint frame-level estimates for different stages of a note event.
Inferring music time structures has a broad range of applications in music production, processing and analysis.
Ranked #1 on Online Beat Tracking on GTZAN
In this paper, we conduct a cross-dataset study on parametric and non-parametric raw-waveform based speaker embeddings through speaker verification experiments.
In this work, several variations of a frontend filterbank learning module are investigated for piano transcription, a challenging low-level music information retrieval task.
The online estimation of rhythmic information, such as beat positions, downbeat positions, and meter, is critical for many real-time music applications.
Ranked #1 on Online Beat Tracking on Rock Corpus
Different from previous ASVspoof challenges, the LA task this year presents codec and transmission channel variability, while the new task DF presents general audio compression.
Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques.
Spoofing countermeasure (CM) systems are critical in speaker verification; they aim to discern spoofing attacks from bona fide speech trials.
An interaction reward model is trained on the duets formed from outer parts of Bach chorales to model counterpoint interaction, while a style reward model is trained on monophonic melodies of Chinese folk songs to model melodic patterns.
Most preexisting OBT methods either apply some offline approaches to a moving window containing past data to make predictions about future beat positions or must be primed with past data at startup to initialize.
Human voices can be used to authenticate the identity of the speaker, but the automatic speaker verification (ASV) systems are vulnerable to voice spoofing attacks, such as impersonation, replay, text-to-speech, and voice conversion.
State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies as speech features.
The applications of short-term user-generated video (UGV), such as Snapchat, and Youtube short-term videos, booms recently, raising lots of multimodal machine learning tasks.
Visual emotion expression plays an important role in audiovisual speech communication.
We cast this as a reinforcement learning problem, where the generation agent learns a policy to generate a musical note (action) based on previously generated context (state).
We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions.
In this paper, we consider a task of such: given an arbitrary audio speech and one lip image of arbitrary target identity, generate synthesized lip movements of the target identity saying the speech.
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos.
Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments.