1 code implementation • 29 Nov 2024 • Julian D Parker, Anton Smirnov, Jordi Pons, CJ Carr, Zack Zukowski, Zach Evans, Xubo Liu
The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context.
no code implementations • 17 Sep 2024 • Xiaoyu Bie, Xubo Liu, Gaël Richard
Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens.
no code implementations • 16 Jul 2024 • Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance.
1 code implementation • 27 Jun 2024 • Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, Lilian Tang
Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator.
1 code implementation • 26 Jun 2024 • Qiushi Huang, Xubo Liu, Tom Ko, Bo Wu, Wenwu Wang, Yu Zhang, Lilian Tang
To alleviate those issues, we propose \textbf{S}elective \textbf{P}rompt \textbf{T}uning (SPT), which softly prompts LLMs for personalized conversations in a selective way.
no code implementations • 20 Jun 2024 • Meng Cui, Xubo Liu, Haohe Liu, Jinzheng Zhao, Daoliang Li, Wenwu Wang
This paper presents a comprehensive review of three interconnected digital aquaculture tasks, namely, fish tracking, counting, and behaviour analysis, using a novel and unified approach.
no code implementations • 21 May 2024 • Peng Gao, Yujian Lee, HUI ZHANG, Xubo Liu, Yiyang Hu, Guquan Jing
Effectively minimizing these cross-modal discrepancies relies on obtaining representations that are guided by identity and consistent across modalities, while also filtering out representations that are irrelevant to identity.
1 code implementation • 28 Apr 2024 • Qixin Deng, Qikai Yang, Ruibin Yuan, Yipeng Huang, Yi Wang, Xubo Liu, Zeyue Tian, Jiahao Pan, Ge Zhang, Hanfeng Lin, Yizhi Li, Yinghao Ma, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenwu Wang, Guangyu Xia, Wei Xue, Yike Guo
Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints.
no code implementations • 27 Apr 2024 • Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang
Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks.
1 code implementation • 14 Mar 2024 • Jinhua Liang, huan zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos
We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing.
2 code implementations • 30 Nov 2023 • Jinhua Liang, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos
Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence.
1 code implementation • 30 Nov 2023 • Yuzhuo Liu, Xubo Liu, Yan Zhao, Yuanyuan Wang, Rui Xia, Pingchuan Tain, Yuxuan Wang
Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen.
no code implementations • 11 Oct 2023 • Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li, Wenwu Wang
Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events.
no code implementations • 14 Sep 2023 • Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance.
Ranked #3 on Audio Generation on AudioCaps
2 code implementations • 10 Aug 2023 • Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley
Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model.
Ranked #4 on Audio Generation on AudioCaps
1 code implementation • 9 Aug 2023 • Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.
1 code implementation • 26 Jul 2023 • Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text.
1 code implementation • 17 Jun 2023 • Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark D. Plumbley, Wenwu Wang
We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model.
no code implementations • 16 Jun 2023 • Özkan Çaylı, Xubo Liu, Volkan Kılıç, Wenwu Wang
Automatically describing audio-visual content with texts, namely video captioning, has received significant attention due to its potential applications across diverse fields.
no code implementations • 30 May 2023 • Jianyuan Sun, Xubo Liu, Xinhao Mei, Volkan Kılıç, Mark D. Plumbley, Wenwu Wang
Experimental results show that LHDFF outperforms existing audio captioning models.
no code implementations • 28 May 2023 • Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data.
no code implementations • CVPR 2023 • Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen
Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16. 9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90, 000 hours).
no code implementations • 7 Mar 2023 • Yi Yuan, Haohe Liu, Jinhua Liang, Xubo Liu, Mark D. Plumbley, Wenwu Wang
Deep neural networks have recently achieved breakthroughs in sound generation.
4 code implementations • 29 Jan 2023 • Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley
By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency.
Ranked #10 on Audio Generation on AudioCaps
no code implementations • 5 Dec 2022 • Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e. g., generating a fixed caption for a given audio clip), simple (e. g., using common words and simple grammar), and generic (e. g., generating the same caption for similar audio clips).
1 code implementation • 22 Nov 2022 • Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D. Plumbley
The proposed metric, ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
1 code implementation • 28 Oct 2022 • Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
Audio captioning aims to generate text descriptions of audio clips.
1 code implementation • 27 Oct 2022 • Qiushi Huang, Yu Zhang, Tom Ko, Xubo Liu, Bo Wu, Wenwu Wang, Lilian Tang
Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona.
no code implementations • 10 Oct 2022 • Jianyuan Sun, Xubo Liu, Xinhao Mei, Mark D. Plumbley, Volkan Kilic, Wenwu Wang
Moreover, in LHDFF, a new PANNs encoder is proposed called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the intermediate convolution layer output and the high-dimensional feature from the final layer output of PANNs.
1 code implementation • 4 Oct 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
The audio spectrogram is a time-frequency representation that has been widely used for audio classification.
1 code implementation • 3 Oct 2022 • Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang
Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios.
no code implementations • 2 Aug 2022 • Arshdeep Singh, James A King, Xubo Liu, Wenwu Wang, Mark D. Plumbley
This technical report describes the SurreyAudioTeam22s submission for DCASE 2022 ASC Task 1, Low-Complexity Acoustic Scene Classification (ASC).
1 code implementation • 15 Jul 2022 • Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
In addition, we use transductive inference on the validation set during training for better adaptation to novel classes.
1 code implementation • 15 Jul 2022 • Yang Xiao, Xubo Liu, James King, Arshdeep Singh, Eng Siong Chng, Mark D. Plumbley, Wenwu Wang
Experimental results on the DCASE 2019 Task 1 and ESC-50 dataset show that our proposed method outperforms baseline continual learning methods on classification accuracy and computational efficiency, indicating our method can efficiently and incrementally learn new classes without the catastrophic forgetting problem for on-device environmental sound classification.
no code implementations • 12 May 2022 • Xinhao Mei, Xubo Liu, Mark D. Plumbley, Wenwu Wang
In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
1 code implementation • 12 Apr 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang
Speech restoration aims to remove distortions in speech signals.
1 code implementation • 29 Mar 2022 • Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets.
1 code implementation • 28 Mar 2022 • Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang
In this paper, we propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
Ranked #2 on Audio Super-Resolution on VCTK Multi-Speaker
1 code implementation • 28 Mar 2022 • Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e. g., "a man tells a joke followed by people laughing").
no code implementations • 7 Mar 2022 • Jianyuan Sun, Xubo Liu, Xinhao Mei, Jinzheng Zhao, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF).
no code implementations • 6 Mar 2022 • Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, Jinzheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks.
no code implementations • 13 Oct 2021 • Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips.
1 code implementation • 5 Aug 2021 • Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, Jingqian Wu, Yusong Wu, Jinzheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao, Mark D. Plumbley, Wenwu Wang
Automated audio captioning aims to use natural language to describe the content of audio data.
2 code implementations • 21 Jul 2021 • Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D. Plumbley, Wenwu Wang
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip.
1 code implementation • 21 Jul 2021 • Xinhao Mei, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
1 code implementation • 21 Jul 2021 • Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
We evaluate our approach on the UrbanSound8K dataset, compared to SampleRNN, with the performance metrics measuring the quality and diversity of generated sounds.
1 code implementation • 19 Jul 2021 • Qiushi Huang, Tom Ko, H Lilian Tang, Xubo Liu, Bo Wu
Punctuation is critical in understanding natural language text.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +6