no code implementations • 30 May 2023 • Jianyuan Sun, Xubo Liu, Xinhao Mei, Volkan Kılıç, Mark D. Plumbley, Wenwu Wang
Experimental results show that LHDFF outperforms existing audio captioning models.
1 code implementation • 30 May 2023 • Arshdeep Singh, Haohe Liu, Mark D. Plumbley
Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking.
no code implementations • 28 May 2023 • Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data.
no code implementations • 5 May 2023 • James A King, Arshdeep Singh, Mark D. Plumbley
For large-scale CNNs such as PANNs designed for audio tagging, our method reduces 24\% computations per inference with 41\% fewer parameters at a slight improvement in performance.
no code implementations • 5 Apr 2023 • Arshdeep Singh, Mark D. Plumbley
In comparison to the existing active filter pruning methods, the proposed pruning method is at least 4. 5 times faster in computing filter importance and is able to achieve similar performance compared to that of the active filter pruning methods.
1 code implementation • 30 Mar 2023 • Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang
To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
no code implementations • 7 Mar 2023 • Yi Yuan, Haohe Liu, Jinhua Liang, Xubo Liu, Mark D. Plumbley, Wenwu Wang
Deep neural networks have recently achieved breakthroughs in sound generation with text prompts.
3 code implementations • 29 Jan 2023 • Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley
By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency.
Ranked #5 on
Audio Generation
on AudioCaps
no code implementations • 5 Dec 2022 • Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e. g., generating a fixed caption for a given audio clip), simple (e. g., using common words and simple grammar), and generic (e. g., generating the same caption for similar audio clips).
1 code implementation • 22 Nov 2022 • Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D. Plumbley
The proposed metric, ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
1 code implementation • 28 Oct 2022 • Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
Audio captioning aims to generate text descriptions of audio clips.
no code implementations • 27 Oct 2022 • Arshdeep Singh, Mark D. Plumbley
However, the computational complexity of computing the pairwise similarity matrix is high, particularly when a convolutional layer has many filters.
no code implementations • 10 Oct 2022 • Jianyuan Sun, Xubo Liu, Xinhao Mei, Mark D. Plumbley, Volkan Kilic, Wenwu Wang
Moreover, in LHDFF, a new PANNs encoder is proposed called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the intermediate convolution layer output and the high-dimensional feature from the final layer output of PANNs.
1 code implementation • 4 Oct 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
Starting from a high-temporal-resolution spectrogram such as one-millisecond hop size, we show that DiffRes can improve classification accuracy with the same computational complexity.
1 code implementation • 3 Oct 2022 • Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang
Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios.
1 code implementation • 5 Sep 2022 • Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang
Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent Network V2 (EINV2) with a novel data augmentation method.
no code implementations • 2 Aug 2022 • Arshdeep Singh, James A King, Xubo Liu, Wenwu Wang, Mark D. Plumbley
This technical report describes the SurreyAudioTeam22s submission for DCASE 2022 ASC Task 1, Low-Complexity Acoustic Scene Classification (ASC).
no code implementations • 23 Jul 2022 • Arshdeep Singh, Mark D. Plumbley
However, CNNs are resource hungry due to their large size and high computational complexity.
1 code implementation • 15 Jul 2022 • Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
In addition, we use transductive inference on the validation set during training for better adaptation to novel classes.
1 code implementation • 15 Jul 2022 • Yang Xiao, Xubo Liu, James King, Arshdeep Singh, Eng Siong Chng, Mark D. Plumbley, Wenwu Wang
Experimental results on the DCASE 2019 Task 1 and ESC-50 dataset show that our proposed method outperforms baseline continual learning methods on classification accuracy and computational efficiency, indicating our method can efficiently and incrementally learn new classes without the catastrophic forgetting problem for on-device environmental sound classification.
no code implementations • 12 May 2022 • Xinhao Mei, Xubo Liu, Mark D. Plumbley, Wenwu Wang
In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
1 code implementation • 29 Mar 2022 • Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets.
1 code implementation • 29 Mar 2022 • Arshdeep Singh, Mark D. Plumbley
We propose a passive filter pruning framework, where a few convolutional filters from the CNNs are eliminated to yield compressed CNNs.
1 code implementation • 28 Mar 2022 • Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e. g., "a man tells a joke followed by people laughing").
no code implementations • 7 Mar 2022 • Jianyuan Sun, Xubo Liu, Xinhao Mei, Jinzheng Zhao, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF).
no code implementations • 6 Mar 2022 • Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, Jinzheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks.
no code implementations • 13 Oct 2021 • Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips.
2 code implementations • 19 Sep 2021 • Turab Iqbal, Yin Cao, Andrew Bailey, Mark D. Plumbley, Wenwu Wang
We show that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and we refer to this type of label noise as open-set label noise.
1 code implementation • 5 Aug 2021 • Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, Jingqian Wu, Yusong Wu, Jinzheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao, Mark D. Plumbley, Wenwu Wang
Automated audio captioning aims to use natural language to describe the content of audio data.
1 code implementation • 21 Jul 2021 • Xinhao Mei, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
Ranked #4 on
Audio captioning
on AudioCaps
2 code implementations • 21 Jul 2021 • Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D. Plumbley, Wenwu Wang
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip.
1 code implementation • 21 Jul 2021 • Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
We evaluate our approach on the UrbanSound8K dataset, compared to SampleRNN, with the performance metrics measuring the quality and diversity of generated sounds.
1 code implementation • 12 Jul 2021 • Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, Mark D. Plumbley
The goal of automatic sound event detection (SED) methods is to recognize what is happening in an audio signal and when it is happening.
2 code implementations • 28 Oct 2020 • Andrew Bailey, Mark D. Plumbley
Depression is a large-scale mental health problem and a challenging area for machine learning researchers in detection of depression.
3 code implementations • 25 Oct 2020 • Yin Cao, Turab Iqbal, Qiuqiang Kong, Fengyan An, Wenwu Wang, Mark D. Plumbley
Polyphonic sound event localization and detection (SELD), which jointly performs sound event detection (SED) and direction-of-arrival (DoA) estimation, detects the type and occurrence time of sound events as well as their corresponding DoA angles simultaneously.
Sound Audio and Speech Processing
2 code implementations • 30 Sep 2020 • Yin Cao, Turab Iqbal, Qiuqiang Kong, Yue Zhong, Wenwu Wang, Mark D. Plumbley
In this paper, a novel event-independent network for polyphonic sound event localization and detection is proposed.
Audio and Speech Processing Sound
1 code implementation • 11 Feb 2020 • Turab Iqbal, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
The proposed method uses an auxiliary classifier, trained on data that is known to be in-distribution, for detection and relabelling.
1 code implementation • 21 Oct 2019 • Emad M. Grais, Fei Zhao, Mark D. Plumbley
In the spectrogram of a mixture of singing voices and music signals, there is usually more information about the voice in the low frequency bands than the high frequency bands.
1 code implementation • 14 Jun 2019 • Qiuqiang Kong, Yong Xu, Wenwu Wang, Philip J. B. Jackson, Mark D. Plumbley
Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available.
1 code implementation • 1 May 2019 • Yin Cao, Qiuqiang Kong, Turab Iqbal, Fengyan An, Wenwu Wang, Mark D. Plumbley
In this paper, it is experimentally shown that the training information of SED is able to contribute to the direction of arrival estimation (DOAE).
Sound Audio and Speech Processing
no code implementations • 1 Nov 2018 • Emad M. Grais, Hagen Wierstorf, Dominic Ward, Russell Mason, Mark D. Plumbley
Current performance evaluation for audio source separation depends on comparing the processed or separated signals with reference signals.
2 code implementations • 12 Apr 2018 • Qiuqiang Kong, Yong Xu, Iwona Sobieraj, Wenwu Wang, Mark D. Plumbley
Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip.
Sound Audio and Speech Processing
no code implementations • 2 Mar 2018 • Emad M. Grais, Dominic Ward, Mark D. Plumbley
Supervised multi-channel audio source separation requires extracting useful spectral, temporal, and spatial features from the mixed signals.
2 code implementations • 8 Nov 2017 • Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley
First, we propose a separation mapping from the time-frequency (T-F) representation of an audio to the T-F segmentation masks of the audio events.
Sound Audio and Speech Processing
5 code implementations • 2 Nov 2017 • Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley
Then the classification of a bag is the expectation of the classification output of the instances in the bag with respect to the learned probability measure.
Sound Audio and Speech Processing
no code implementations • 28 Oct 2017 • Emad M. Grais, Hagen Wierstorf, Dominic Ward, Mark D. Plumbley
In deep neural networks with convolutional layers, each layer typically has fixed-size/single-resolution receptive field (RF).
3 code implementations • 1 Oct 2017 • Yong Xu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge.
Sound Audio and Speech Processing
4 code implementations • 23 Mar 2017 • Emad M. Grais, Mark D. Plumbley
Each CDAE is trained to separate one source and treats the other sources as background noise.
Sound 68T01 H.5.5; I.5; I.2.6; I.4.3
1 code implementation • 17 Mar 2017 • Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, Mark D. Plumbley
Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed task in the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge.
Sound
2 code implementations • 24 Feb 2017 • Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, Mark D. Plumbley
In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging.
no code implementations • 15 Jul 2016 • Siddharth Sigtia, Adam M. Stark, Sacha Krstulovic, Mark D. Plumbley
In the context of the Internet of Things (IoT), sound sensing applications are required to run on embedded platforms where notions of product pricing and form factor impose hard constraints on the available computing power.
2 code implementations • 13 Jul 2016 • Yong Xu, Qiang Huang, Wenwu Wang, Peter Foster, Siddharth Sigtia, Philip J. B. Jackson, Mark D. Plumbley
For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (sDAE or aDAE) to generate new data-driven features from the Mel-Filter Banks (MFBs) features.
no code implementations • 13 Jul 2016 • Yong Xu, Qiang Huang, Wenwu Wang, Mark D. Plumbley
In this paper, we present a deep neural network (DNN)-based acoustic scene classification framework.
no code implementations • 24 Jun 2016 • Yong Xu, Qiang Huang, Wenwu Wang, Philip J. B. Jackson, Mark D. Plumbley
Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input.
1 code implementation • 17 Apr 2015 • Andrew J. R. Simpson, Gerard Roma, Mark D. Plumbley
Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition.
no code implementations • 13 Nov 2014 • Daniele Barchiesi, Dimitrios Giannoulis, Dan Stowell, Mark D. Plumbley
We then describe a range of different algorithms submitted for a data challenge that was held to provide a general and fair benchmark for ASC techniques.
no code implementations • 26 May 2014 • Dan Stowell, Mark D. Plumbley
Feature learning can be performed at large scale and "unsupervised", meaning it requires no manual data labelling, yet it can improve performance on "supervised" tasks such as classification.