no code implementations • 23 Apr 2024 • Haozhe Cheng, Cheng Ju, Haicheng Wang, Jinxiang Liu, Mengting Chen, Qiang Hu, Xiaoyun Zhang, Yanfeng Wang
The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising.
no code implementations • 17 Mar 2024 • Jinxiang Liu, Yikun Liu, Fei Zhang, Chen Ju, Ya zhang, Yanfeng Wang
NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects.
no code implementations • 25 Jul 2023 • Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, Ya zhang
The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues.
no code implementations • 18 May 2023 • Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya zhang, Weidi Xie
The objective of Audio-Visual Segmentation (AVS) is to localise the sounding objects within visual scenes by accurately predicting pixel-wise segmentation masks.
no code implementations • 17 Mar 2023 • Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Jinxiang Liu, Yu Wang, Ya zhang, Yanfeng Wang
However, the challenges exist as there is one structural difference between generative and discriminative models, which limits the direct use.
no code implementations • 20 Feb 2023 • Chen Ju, Haicheng Wang, Jinxiang Liu, Chaofan Ma, Ya zhang, Peisen Zhao, Jianlong Chang, Qi Tian
Temporal sentence grounding aims to detect the event timestamps described by the natural language query from given untrimmed videos.
no code implementations • CVPR 2023 • Chen Ju, Kunhao Zheng, Jinxiang Liu, Peisen Zhao, Ya zhang, Jianlong Chang, Yanfeng Wang, Qi Tian
And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance.
Weakly-supervised Temporal Action Localization Weakly Supervised Temporal Action Localization
no code implementations • 26 Jun 2022 • Jinxiang Liu, Chen Ju, Weidi Xie, Ya zhang
We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos.
no code implementations • 24 Sep 2021 • Jinxiang Liu, Yangheng Zhao, Siheng Chen, Ya zhang
To leverage the human body shape prior, LPNet exploits the topological information of the body mesh to learn an expressive visual representation for the target person in the 3D mesh space.