Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks

Audio-Guided video semantic segmentation is a challenging problem in visual analysis and editing, which automatically separates foreground objects from background in a video sequence according to the referring audio expressions. However, the existing referring video semantic segmentation works mainly focus on the guidance of text-based referring expressions, due to the lack of modeling the semantic representation of audio-video interaction contents. In this paper, we consider the problem of audio-guided video semantic segmentation from the viewpoint of end-to-end denoised encoder-decoder network learning. We propose the walvelet-based encoder network to learn the crossmodal representations of the video contents with audio-form queries. Specifically, we adopt a multi-head cross-modal attention to explore the potential relations of video and query contents. A 2-dimension discrete wavelet transform is employed to decompose the audio-video features. We quantify the thresholds of high frequency coefficients to filter the noise and outliers. Then, a self attention-free decoder network is developed to generate the target masks with frequency domain transforms. Moreover, we maximize mutual information between the encoded features and multi-modal features after cross-modal attention to enhance the audio guidance. In addition, we construct the first large-scale audio-guided video semantic segmentation dataset. The extensive experiments show the effectiveness of our method.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here