no code implementations • 5 Sep 2024 • Xi Chen, Haosen Yang, Sheng Jin, Xiatian Zhu, Hongxun Yao
To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation-the performance bottleneck.
no code implementations • 13 Jun 2024 • Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu
To obtain a material-aware and geometry-aware condition for audio synthesis, we learn an explicit point-based scene representation with an audio-guidance parameter on locally initialized Gaussian points, taking into account the space relation from the listener and sound source.
no code implementations • 6 Jun 2024 • Haosen Yang, Chenhao Zhang, Wenqing Wang, Marco Volino, Adrian Hilton, Li Zhang, Xiatian Zhu
To address these limitations, we propose a Localized Point Management (LPM) strategy, capable of identifying those error-contributing zones in the highest demand for both point addition and geometry calibration.
no code implementations • 21 Mar 2024 • Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiangkang Deng, Xiatian Zhu
Audio-Visual Segmentation (AVS) aims to identify, at the pixel level, the object in a visual scene that produces a given sound.
1 code implementation • 17 Mar 2024 • Xi Chen, Haosen Yang, Huicong Zhang, Hongxun Yao, Xiatian Zhu
Source-free unsupervised domain adaptation (SFUDA) aims to enable the utilization of a pre-trained source model in an unlabeled target domain without access to source data.
no code implementations • 14 Mar 2024 • Hong Liu, Haosen Yang, Paul J. van Diest, Josien P. W. Pluim, Mitko Veta
In particular, our model outperforms SAM by 4. 1 and 2. 5 percent points on a ductal carcinoma in situ (DCIS) segmentation tasks and breast cancer metastasis segmentation task (CAMELYON16 dataset).
1 code implementation • 2 Nov 2023 • Haosen Yang, Chuofan Ma, Bin Wen, Yi Jiang, Zehuan Yuan, Xiatian Zhu
Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals.
no code implementations • 13 Sep 2023 • Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Xiatian Zhu
Particularly, in situations where existing supervised AVS methods struggle with overlapping foreground objects, our models still excel in accurately segmenting overlapped auditory objects.
1 code implementation • 9 Oct 2022 • Haosen Yang, Deng Huang, Bin Wen, Jiannan Wu, Hongxun Yao, Yi Jiang, Xiatian Zhu, Zehuan Yuan
As a result, our model can extract effectively both static appearance and dynamic motion spontaneously, leading to superior spatiotemporal representation learning capability.
no code implementations • 21 Jul 2022 • Boyang xia, Wenhao Wu, Haoran Wang, Rui Su, Dongliang He, Haosen Yang, Xiaoran Fan, Wanli Ouyang
On the video level, a temporal attention module is learned under dual video-level supervisions on both the salient and the non-salient representations.
Ranked #4 on Action Recognition on ActivityNet
1 code implementation • 15 Dec 2021 • Haosen Yang, Wenhao Wu, Lining Wang, Sheng Jin, Boyang xia, Hongxun Yao, Hujie Huang
To evaluate the confidence of proposals, the existing works typically predict action score of proposals that are supervised by the temporal Intersection-over-Union (tIoU) between proposal and the ground-truth.
no code implementations • 25 May 2021 • Lining Wang, Haosen Yang, Wenhao Wu, Hongxun Yao, Hujie Huang
Conventionally, the temporal action proposal generation (TAPG) task is divided into two main sub-tasks: boundary prediction and proposal confidence prediction, which rely on the frame-level dependencies and proposal-level relationships separately.