1 code implementation • 6 Feb 2025 • Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue
Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute.
no code implementations • 6 Jan 2025 • Dichucheng Li, Yongyi Zang, Qiuqiang Kong
Automatic Music Transcription (AMT), aiming to get musical notes from raw audio, typically uses frame-level systems with piano-roll outputs or language model (LM)-based systems with note-level predictions.
1 code implementation • 12 Oct 2024 • Xiquan Li, Wenxi Chen, Ziyang Ma, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Qiuqiang Kong, Xie Chen
By tailoring the text embedding support and the caption datastore to the target domain, DRCap acquires a robust ability to adapt to new domains in a training-free manner.
no code implementations • 15 Sep 2024 • Yudong Yang, Zhan Liu, Wenyi Yu, Guangzhi Sun, Qiuqiang Kong, Chao Zhang
Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions.
no code implementations • 14 Sep 2024 • Hao Ma, Zhiyuan Peng, Xu Li, Yukai Li, Mingjie Shao, Qiuqiang Kong, Ju Liu
In a vanilla parallel-data-free training stage, target audio is encoded using the pre-trained CLAP audio encoder to form a condition embedding, while during testing, user language queries are encoded by CLAP text encoder as the condition embedding.
1 code implementation • 30 Aug 2024 • Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation.
1 code implementation • 26 Aug 2024 • Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, huan zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg, Ruibin Yuan, Shangda Wu, Shih-Lun Wu, Shuqi Dai, Shun Lei, Shiyin Kang, Simon Dixon, Wenhu Chen, Wenhao Huang, Xingjian Du, Xingwei Qu, Xu Tan, Yizhi Li, Zeyue Tian, Zhiyong Wu, Zhizheng Wu, Ziyang Ma, Ziyu Wang
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music.
no code implementations • 16 Jul 2024 • Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance.
no code implementations • 4 Jun 2024 • Renmingyue Du, Jixun Yao, Qiuqiang Kong, Yin Cao
To enhance the distinctiveness of the reconstructed features by each decoder, we incorporate contrastive learning and an auxiliary classifier to further constrain the reconstructed feature.
1 code implementation • 14 Mar 2024 • Jinhua Liang, huan zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos
We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing.
1 code implementation • 27 Dec 2023 • Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang
In addition, we introduce environment representations to characterize different acoustic settings, enhancing the adaptability of our attenuation approach to various environments.
no code implementations • 16 Oct 2023 • Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong
However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags.
1 code implementation • 15 Oct 2023 • Dichucheng Li, Yinghao Ma, Weixing Wei, Qiuqiang Kong, Yulun Wu, Mingjin Che, Fan Xia, Emmanouil Benetos, Wei Li
Recognizing the significance of pitch in capturing the nuances of IPTs and the importance of onset in locating IPT events, we investigate multi-task finetuning with pitch and onset detection as auxiliary tasks.
2 code implementations • 10 Aug 2023 • Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley
Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model.
Ranked #4 on
Audio Generation
on AudioCaps
(FAD metric)
1 code implementation • 9 Aug 2023 • Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.
1 code implementation • 26 Jul 2023 • Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text.
no code implementations • 18 May 2023 • Zelin Ying, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, YuanYuan Huo, Yuxuan Wang
The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes.
no code implementations • 12 May 2023 • Zhichao Wang, Liumeng Xue, Qiuqiang Kong, Lei Xie, Yuanzhe Chen, Qiao Tian, Yuping Wang
Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech, we propose a novel fine-grained speaker modeling method, called temporal-channel retrieval (TCR), to find out when and where speaker information appears in speech.
3 code implementations • 30 Mar 2023 • Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang
To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Ranked #1 on
Zero-Shot Environment Sound Classification
on ESC-50
(using extra training data)
no code implementations • 1 Feb 2023 • Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Ju-Chiang Wang, Yun-Ning Hung, Dorien Herremans
Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results.
1 code implementation • 22 Nov 2022 • Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D. Plumbley
The proposed metric, ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
no code implementations • 4 Nov 2022 • Yin Zhu, Qiuqiang Kong, Junjie Shi, Shilei Liu, Xuzhou Ye, Ju-Chiang Wang, Junping Zhang
Binaural rendering of ambisonic signals is of broad interest to virtual reality and immersive media.
1 code implementation • 28 Oct 2022 • Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
Audio captioning aims to generate text descriptions of audio clips.
no code implementations • 27 Oct 2022 • Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, Yuxuan Wang
In this paper, we propose to use intermediate bottleneck features (IBFs) to replace PPGs.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 4 Oct 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
The audio spectrogram is a time-frequency representation that has been widely used for audio classification.
1 code implementation • 3 Oct 2022 • Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang
Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios.
2 code implementations • 5 Sep 2022 • Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang
Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent Network V2 (EINV2) with a novel data augmentation method.
1 code implementation • 15 Jul 2022 • Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
In addition, we use transductive inference on the validation set during training for better adaptation to novel classes.
no code implementations • 22 Jun 2022 • Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Amy Hung, Ju-Chiang Wang, Dorien Herremans
However, its novelty necessitates a new perspective on how to evaluate such a model.
Ranked #4 on
Music Transcription
on Slakh2100
1 code implementation • 12 Apr 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang
Speech restoration aims to remove distortions in speech signals.
1 code implementation • 28 Mar 2022 • Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang
In this paper, we propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
Ranked #2 on
Audio Super-Resolution
on VCTK Multi-Speaker
1 code implementation • 28 Mar 2022 • Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e. g., "a man tells a joke followed by people laughing").
1 code implementation • 9 Dec 2021 • Haohe Liu, Qiuqiang Kong, Jiafeng Liu
On the MUSDB18HQ test set, we propose a 276-layer CWS-PResUNet and achieve state-of-the-art (SoTA) performance on vocals with an 8. 92 signal-to-distortion ratio (SDR) score.
Ranked #11 on
Music Source Separation
on MUSDB18
1 code implementation • 7 Aug 2021 • Liwei Lin, Qiuqiang Kong, Junyan Jiang, Gus Xia
We propose a unified model for three inter-related tasks: 1) to \textit{separate} individual sound sources from a mixed music audio, 2) to \textit{transcribe} each sound source to MIDI notes, and 3) to\textit{ synthesize} new pieces based on the timbre of separated sources.
1 code implementation • 30 Mar 2021 • Feiyang Xiao, Jian Guan, Qiuqiang Kong, Wenwu Wang
Speech enhancement aims to obtain speech signals with high intelligibility and quality from noisy speech.
no code implementations • 28 Oct 2020 • Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang
Music classification is a task to classify a music piece into labels such as genres or composers.
3 code implementations • 25 Oct 2020 • Yin Cao, Turab Iqbal, Qiuqiang Kong, Fengyan An, Wenwu Wang, Mark D. Plumbley
Polyphonic sound event localization and detection (SELD), which jointly performs sound event detection (SED) and direction-of-arrival (DoA) estimation, detects the type and occurrence time of sound events as well as their corresponding DoA angles simultaneously.
Sound Audio and Speech Processing
3 code implementations • 11 Oct 2020 • Qiuqiang Kong, Bochen Li, Jitong Chen, Yuxuan Wang
In this article, we create a GiantMIDI-Piano (GP) dataset containing 38, 700, 838 transcribed notes and 10, 855 unique solo piano works composed by 2, 786 composers.
3 code implementations • 5 Oct 2020 • Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang
In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings.
Ranked #5 on
Music Transcription
on MAESTRO
Music Transcription
Sound
Audio and Speech Processing
2 code implementations • 30 Sep 2020 • Yin Cao, Turab Iqbal, Qiuqiang Kong, Yue Zhong, Wenwu Wang, Mark D. Plumbley
In this paper, a novel event-independent network for polyphonic sound event localization and detection is proposed.
Audio and Speech Processing Sound
no code implementations • 25 Jul 2020 • Jingqiao Zhao, Zhen-Hua Feng, Qiuqiang Kong, Xiaoning Song, Xiao-Jun Wu
This paper presents a Depthwise Disout Convolutional Neural Network (DD-CNN) for the detection and classification of urban acoustic scenes.
no code implementations • 16 Jul 2020 • Boqing Zhu, Kele Xu, Qiuqiang Kong, Huaimin Wang, Yuxing Peng
Yet, it is labor-intensive to accurately annotate large amount of audio data, and the dataset may contain noisy labels in the practical settings.
1 code implementation • 11 Feb 2020 • Turab Iqbal, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
The proposed method uses an auxiliary classifier, trained on data that is known to be in-distribution, for detection and relabelling.
no code implementations • 3 Jul 2019 • Jie Jiang, Qiuqiang Kong, Mark Plumbley, Nigel Gilbert
On the basis of energy disaggregation, we then investigate the performance of two deep-learning based frameworks for the task of on/off detection which aims at estimating whether an appliance is in operation or not.
1 code implementation • 14 Jun 2019 • Qiuqiang Kong, Yong Xu, Wenwu Wang, Philip J. B. Jackson, Mark D. Plumbley
Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available.
1 code implementation • 1 May 2019 • Yin Cao, Qiuqiang Kong, Turab Iqbal, Fengyan An, Wenwu Wang, Mark D. Plumbley
In this paper, it is experimentally shown that the training information of SED is able to contribute to the direction of arrival estimation (DOAE).
Sound Audio and Speech Processing
2 code implementations • 30 Oct 2018 • Kele Xu, Boqing Zhu, Qiuqiang Kong, Haibo Mi, Bo Ding, Dezhi Wang, Huaimin Wang
Audio tagging is challenging due to the limited size of data and noisy labels.
1 code implementation • 6 Aug 2018 • Yuanbo Hou, Qiuqiang Kong, Shengchen Li
To use the order information of sound events, we propose sequential labelled data (SLD), where both the presence or absence and the order information of sound events are known.
2 code implementations • 12 Apr 2018 • Qiuqiang Kong, Yong Xu, Iwona Sobieraj, Wenwu Wang, Mark D. Plumbley
Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip.
Sound Audio and Speech Processing
5 code implementations • 6 Mar 2018 • Changsong Yu, Karim Said Barsim, Qiuqiang Kong, Bin Yang
The objective of audio classification is to predict the presence or absence of audio events in an audio clip.
2 code implementations • 8 Nov 2017 • Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley
First, we propose a separation mapping from the time-frequency (T-F) representation of an audio to the T-F segmentation masks of the audio events.
Sound Audio and Speech Processing
5 code implementations • 2 Nov 2017 • Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley
Then the classification of a bag is the expectation of the classification output of the instances in the bag with respect to the learned probability measure.
Sound Audio and Speech Processing
3 code implementations • 1 Oct 2017 • Yong Xu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge.
Sound Audio and Speech Processing
1 code implementation • 17 Mar 2017 • Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, Mark D. Plumbley
Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed task in the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge.
Sound
2 code implementations • 24 Feb 2017 • Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, Mark D. Plumbley
In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging.
1 code implementation • 6 Oct 2016 • Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark Plumbley
The labeling of an audio clip is often based on the audio events in the clip and no event level label is provided to the user.
Sound