no code implementations • 21 Apr 2025 • Huadai Liu, Tianyi Luo, Qikai Jiang, Kaicheng Luo, Peiwen Sun, Jialei Wan, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue
To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data.
no code implementations • 14 Oct 2024 • Peiwen Sun, Sitong Cheng, Xiangtai Li, Zhen Ye, Huadai Liu, Honggang Zhang, Wei Xue, Yike Guo
However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions.
1 code implementation • 30 Aug 2024 • Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation.
no code implementations • 23 Jul 2024 • Peiwen Sun, Honggang Zhang, Di Hu
For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries.
1 code implementation • 16 Jul 2024 • Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu
Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes.
no code implementations • 15 Jul 2024 • Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu
In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues.
1 code implementation • 15 Jul 2024 • Yaoting Wang, Peiwen Sun, Yuanchao Li, Honggang Zhang, Di Hu
The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues.
1 code implementation • 23 Apr 2024 • Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Wei Xue, Qifeng Liu, Yike Guo
The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation.
1 code implementation • 23 Mar 2024 • Nishant Kumar, Ziyan Tao, Jaikirat Singh, Yang Li, Peiwen Sun, Binghui Zhao, Stefan Gumhold
Image fusion typically employs non-invertible neural networks to merge multiple source images into a single fused image.
no code implementations • 12 Dec 2023 • Peiwen Sun, Yifan Zhang, Zishan Liu, Donghao Chen, Honggang Zhang
The vanilla fusion methods still dominate a large percentage of mainstream audio-visual tasks.
no code implementations • 9 Sep 2022 • Peiwen Sun, Shanshan Zhang, Zishan Liu, Yougen Yuan, Taotao Zhang, Honggang Zhang, Pengfei Hu
It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification.