Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events.
Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance.
Ranked #1 on Audio Generation on AudioCaps
Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model.
Ranked #2 on Audio Generation on AudioCaps
In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.
Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text.
We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model.
Automatically describing audio-visual content with texts, namely video captioning, has received significant attention due to its potential applications across diverse fields.
Experimental results show that LHDFF outperforms existing audio captioning models.
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data.
no code implementations • • Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen
Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16. 9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90, 000 hours).
Deep neural networks have recently achieved breakthroughs in sound generation with text prompts.
By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency.
Ranked #6 on Audio Generation on AudioCaps
Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e. g., generating a fixed caption for a given audio clip), simple (e. g., using common words and simple grammar), and generic (e. g., generating the same caption for similar audio clips).
The proposed metric, ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
1 code implementation • 28 Oct 2022 • Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
Audio captioning aims to generate text descriptions of audio clips.
Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona.
Moreover, in LHDFF, a new PANNs encoder is proposed called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the intermediate convolution layer output and the high-dimensional feature from the final layer output of PANNs.
Starting from a high-temporal-resolution spectrogram such as one-millisecond hop size, we show that DiffRes can improve classification accuracy with the same computational complexity.
Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios.
This technical report describes the SurreyAudioTeam22s submission for DCASE 2022 ASC Task 1, Low-Complexity Acoustic Scene Classification (ASC).
In addition, we use transductive inference on the validation set during training for better adaptation to novel classes.
Experimental results on the DCASE 2019 Task 1 and ESC-50 dataset show that our proposed method outperforms baseline continual learning methods on classification accuracy and computational efficiency, indicating our method can efficiently and incrementally learn new classes without the catastrophic forgetting problem for on-device environmental sound classification.
In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
Speech restoration aims to remove distortions in speech signals.
We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets.
In this paper, we propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
Ranked #2 on Audio Super-Resolution on VCTK Multi-Speaker
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e. g., "a man tells a joke followed by people laughing").
In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF).
BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks.
As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips.
1 code implementation • 5 Aug 2021 • Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, Jingqian Wu, Yusong Wu, Jinzheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao, Mark D. Plumbley, Wenwu Wang
Automated audio captioning aims to use natural language to describe the content of audio data.
We evaluate our approach on the UrbanSound8K dataset, compared to SampleRNN, with the performance metrics measuring the quality and diversity of generated sounds.
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip.
In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
Ranked #5 on Audio captioning on AudioCaps
Punctuation is critical in understanding natural language text.