no code implementations • 23 Jun 2024 • Jinzheng Zhao, Xinyuan Qian, Yong Xu, Haohe Liu, Yin Cao, Davide Berghi, Wenwu Wang
Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA).
Room Impulse Response (RIR) Sound Event Localization and Detection +1
no code implementations • 20 Jun 2024 • Meng Cui, Xubo Liu, Haohe Liu, Jinzheng Zhao, Daoliang Li, Wenwu Wang
This review provides a comprehensive analysis of the current state of aquaculture digital technologies, including vision-based, acoustic-based, and biosensor-based methods.
no code implementations • 30 Apr 2024 • Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, Mark D. Plumbley
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data.
no code implementations • 27 Apr 2024 • Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang
Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks.
1 code implementation • 23 Apr 2024 • Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue
The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation.
1 code implementation • 14 Mar 2024 • Jinhua Liang, huan zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos
We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing.
1 code implementation • 5 Feb 2024 • Jisheng Bai, Mou Wang, Haohe Liu, Han Yin, Yafei Jia, Siwei Huang, Yutong Du, Dongzhe Zhang, Dongyuan Shi, Woon-Seng Gan, Mark D. Plumbley, Susanto Rahardja, Bin Xiang, Jianfeng Chen
In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is important to study the possible ways to utilize these unlabelled data.
no code implementations • 14 Sep 2023 • Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance.
Ranked #3 on Audio Generation on AudioCaps
1 code implementation • 13 Sep 2023 • Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark D. Plumbley
Audio super-resolution is a fundamental task that predicts high-frequency components for low-resolution audio, enhancing audio quality in digital applications.
2 code implementations • 10 Aug 2023 • Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley
Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model.
Ranked #4 on Audio Generation on AudioCaps
1 code implementation • 9 Aug 2023 • Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.
1 code implementation • 3 Aug 2023 • Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov
Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation.
1 code implementation • 26 Jul 2023 • Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text.
1 code implementation • 17 Jun 2023 • Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark D. Plumbley, Wenwu Wang
We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model.
1 code implementation • 30 May 2023 • Arshdeep Singh, Haohe Liu, Mark D. Plumbley
Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking.
no code implementations • 28 May 2023 • Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data.
3 code implementations • 30 Mar 2023 • Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang
To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Ranked #1 on Zero-Shot Environment Sound Classification on ESC-50 (using extra training data)
no code implementations • 7 Mar 2023 • Yi Yuan, Haohe Liu, Jinhua Liang, Xubo Liu, Mark D. Plumbley, Wenwu Wang
Deep neural networks have recently achieved breakthroughs in sound generation.
4 code implementations • 29 Jan 2023 • Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley
By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency.
Ranked #10 on Audio Generation on AudioCaps
1 code implementation • 30 Dec 2022 • Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.
1 code implementation • 22 Nov 2022 • Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D. Plumbley
The proposed metric, ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
1 code implementation • 28 Oct 2022 • Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
Audio captioning aims to generate text descriptions of audio clips.
1 code implementation • 4 Oct 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
The audio spectrogram is a time-frequency representation that has been widely used for audio classification.
1 code implementation • 3 Oct 2022 • Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang
Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios.
1 code implementation • 15 Jul 2022 • Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
In addition, we use transductive inference on the validation set during training for better adaptation to novel classes.
1 code implementation • 30 May 2022 • Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiang-Yang Li, Tao Qin, Sheng Zhao, Tie-Yan Liu
Combining this novel perspective of two-stage synthesis with advanced generative models (i. e., the diffusion models), the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples.
3 code implementations • 9 May 2022 • Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, YuanHao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu
In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Ranked #1 on Text-To-Speech Synthesis on LJSpeech (using extra training data)
1 code implementation • 12 Apr 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang
Speech restoration aims to remove distortions in speech signals.
1 code implementation • 28 Mar 2022 • Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e. g., "a man tells a joke followed by people laughing").
1 code implementation • 28 Mar 2022 • Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang
In this paper, we propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
Ranked #2 on Audio Super-Resolution on VCTK Multi-Speaker
no code implementations • 6 Mar 2022 • Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, Jinzheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks.
1 code implementation • 9 Dec 2021 • Haohe Liu, Qiuqiang Kong, Jiafeng Liu
On the MUSDB18HQ test set, we propose a 276-layer CWS-PResUNet and achieve state-of-the-art (SoTA) performance on vocals with an 8. 92 signal-to-distortion ratio (SDR) score.
Ranked #11 on Music Source Separation on MUSDB18
1 code implementation • 12 Aug 2020 • Haohe Liu, Lei Xie, Jian Wu, Geng Yang
We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands.
Audio and Speech Processing Sound