1 code implementation • 28 Aug 2024 • You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan
With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers.
1 code implementation • 17 Aug 2024 • Samuele Cornell, Jordan Darefsky, Zhiyao Duan, Shinji Watanabe
In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis.
1 code implementation • 20 Jun 2024 • Kyungbok Lee, You Zhang, Zhiyao Duan
Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake.
1 code implementation • 15 Jun 2024 • Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan
We open-source the dataset and TTS models.
1 code implementation • 4 Jun 2024 • Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan
Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals.
1 code implementation • 8 May 2024 • You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan
The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry.
2 code implementations • 15 Apr 2024 • Yujia Yan, Zhiyao Duan
The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription.
Ranked #1 on Music Transcription on SMD Piano
1 code implementation • 23 Feb 2024 • Frank Cwitkowitz, Zhiyao Duan
Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures.
1 code implementation • 24 Nov 2023 • Enting Zhou, You Zhang, Zhiyao Duan
In this work, we propose to learn the AV representation from categorical emotion labels of speech.
no code implementations • 16 Sep 2023 • Yongyi Zang, Yi Zhong, Frank Cwitkowitz, Zhiyao Duan
Guitar Tablature Transcription (GTT) is an important task with broad applications in music education, composition, and entertainment.
1 code implementation • 14 Sep 2023 • Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan
These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection.
2 code implementations • 27 Jul 2023 • Yutong Wen, You Zhang, Zhiyao Duan
We further show that these normalized HRTFs can be used to learn a more unified HRTF representation across databases than the prior art.
no code implementations • 4 Jun 2023 • Mojtaba Heydari, Ju-Chiang Wang, Zhiyao Duan
Singing voice beat and downbeat tracking posses several applications in automatic music production, analysis and manipulation.
1 code implementation • 11 Mar 2023 • Ge Zhu, Yujia Yan, Juan-Pablo Caceres, Zhiyao Duan
Non-linguistic filler words, such as "uh" or "um", are prevalent in spontaneous speech and serve as indicators for expressing hesitation or uncertainty.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
2 code implementations • 4 Nov 2022 • Siwen Ding, You Zhang, Zhiyao Duan
Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space.
2 code implementations • 27 Oct 2022 • You Zhang, Yuxiang Wang, Zhiyao Duan
In this work, we propose to use neural fields, a differentiable representation of functions through neural networks, to model HRTFs with arbitrary spatial sampling schemes.
1 code implementation • 23 Sep 2022 • Meiying Chen, Zhiyao Duan
In this paper, we propose ControlVC, the first neural voice conversion system that achieves time-varying controls on pitch and speed.
1 code implementation • 31 Aug 2022 • Mojtaba Heydari, Zhiyao Duan
Tracking beats of singing voices without the presence of musical accompaniment can find many applications in music production, automatic song arrangement, and social media interaction.
1 code implementation • 28 Jul 2022 • Yuxiang Wang, You Zhang, Zhiyao Duan, Mark Bocko
For the HRTF data, we use truncated spherical harmonic (SH) coefficients to represent the HRTF magnitudes and onsets.
no code implementations • 21 Jun 2022 • Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, ChangShui Zhang
This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking.
1 code implementation • 19 Apr 2022 • Ge Zhu, Jordan Darefsky, Fei Jiang, Anton Selitskiy, Zhiyao Duan
Fully-supervised models for source separation are trained on parallel mixture-source data and are currently state-of-the-art.
2 code implementations • 17 Apr 2022 • Frank Cwitkowitz, Jonathan Driedger, Zhiyao Duan
This naturally enforces playability constraints for guitar, and yields tablature which is more consistent with the symbolic data used to estimate pairwise likelihoods.
1 code implementation • 10 Feb 2022 • You Zhang, Ge Zhu, Zhiyao Duan
We further propose fusion strategies for direct inference and fine-tuning to predict the SASV score based on the framework.
1 code implementation • NeurIPS 2021 • Yujia Yan, Frank Cwitkowitz, Zhiyao Duan
When formulating piano transcription in this way, we eliminate the need to rely on disjoint frame-level estimates for different stages of a note event.
Ranked #7 on Music Transcription on MAESTRO
1 code implementation • 1 Nov 2021 • Mojtaba Heydari, Matthew McCallum, Andreas Ehmann, Zhiyao Duan
Inferring music time structures has a broad range of applications in music production, processing and analysis.
Ranked #1 on Online Beat Tracking on GTZAN
1 code implementation • 8 Oct 2021 • Ge Zhu, Frank Cwitkowitz, Zhiyao Duan
In this paper, we conduct a cross-dataset study on parametric and non-parametric raw-waveform based speaker embeddings through speaker verification experiments.
2 code implementations • 23 Aug 2021 • Frank Cwitkowitz, Mojtaba Heydari, Zhiyao Duan
In this work, several variations of a frontend filterbank learning module are investigated for piano transcription, a challenging low-level music information retrieval task.
4 code implementations • 8 Aug 2021 • Mojtaba Heydari, Frank Cwitkowitz, Zhiyao Duan
The online estimation of rhythmic information, such as beat positions, downbeat positions, and meter, is critical for many real-time music applications.
Ranked #1 on Online Beat Tracking on Rock Corpus
2 code implementations • 26 Jul 2021 • Xinhui Chen, You Zhang, Ge Zhu, Zhiyao Duan
Different from previous ASVspoof challenges, the LA task this year presents codec and transmission channel variability, while the new task DF presents general audio compression.
no code implementations • 1 Jul 2021 • Bochen Li, Yuxuan Wang, Zhiyao Duan
Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques.
3 code implementations • 3 Apr 2021 • You Zhang, Ge Zhu, Fei Jiang, Zhiyao Duan
Spoofing countermeasure (CM) systems are critical in speaker verification; they aim to discern spoofing attacks from bona fide speech trials.
1 code implementation • NeurIPS 2020 • Nan Jiang, Sheng Jin, Zhiyao Duan, ChangShui Zhang
An interaction reward model is trained on the duets formed from outer parts of Bach chorales to model counterpoint interaction, while a style reward model is trained on monophonic melodies of Chinese folk songs to model melodic patterns.
1 code implementation • 5 Nov 2020 • Mojtaba Heydari, Zhiyao Duan
Most preexisting OBT methods either apply some offline approaches to a moving window containing past data to make predictions about future beat positions or must be primed with past data at startup to initialize.
3 code implementations • 27 Oct 2020 • You Zhang, Fei Jiang, Zhiyao Duan
Human voices can be used to authenticate the identity of the speaker, but the automatic speaker verification (ASV) systems are vulnerable to voice spoofing attacks, such as impersonation, replay, text-to-speech, and voice conversion.
1 code implementation • 24 Oct 2020 • Ge Zhu, Fei Jiang, Zhiyao Duan
State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies as speech features.
no code implementations • 14 Sep 2020 • Runze Su, Fei Tao, Xudong Liu, Hao-Ran Wei, Xiaorong Mei, Zhiyao Duan, Lei Yuan, Ji Liu, Yuying Xie
The applications of short-term user-generated video (UGV), such as Snapchat, and Youtube short-term videos, booms recently, raising lots of multimodal machine learning tasks.
1 code implementation • 8 Aug 2020 • Sefik Emre Eskimez, You Zhang, Zhiyao Duan
Visual emotion expression plays an important role in audiovisual speech communication.
no code implementations • 8 Feb 2020 • Nan Jiang, Sheng Jin, Zhiyao Duan, Chang-Shui Zhang
We cast this as a reinforcement learning problem, where the generation agent learns a policy to generate a musical note (action) based on previously generated context (state).
1 code implementation • 9 May 2019 • Lele Chen, Ross K. Maddox, Zhiyao Duan, Chenliang Xu
We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions.
1 code implementation • ECCV 2018 • Lele Chen, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, Chenliang Xu
In this paper, we consider a task of such: given an arbitrary audio speech and one lip image of arbitrary target identity, generate synthesized lip movements of the target identity saying the speech.
no code implementations • 26 Mar 2018 • Sefik Emre Eskimez, Ross K. Maddox, Chenliang Xu, Zhiyao Duan
In this paper, we present a system that can generate landmark points of a talking face from an acoustic speech in real time.
2 code implementations • ECCV 2018 • Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos.
no code implementations • 26 Apr 2017 • Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, Chenliang Xu
Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments.