2 code implementations • 27 Apr 2024 • Xiao Wang, Qian Zhu, Jiandong Jin, Jun Zhu, Futian Wang, Bo Jiang, YaoWei Wang, Yonghong Tian
Specifically, we formulate the video-based PAR as a vision-language fusion problem and adopt a pre-trained foundation model CLIP to extract the visual features.
1 code implementation • 22 Dec 2023 • Lei Liu, Chenglong Li, Futian Wang, Longfeng Shen, Jin Tang
In particular, we design a multi-modal prototype to represent target information by multi-kind samples, including a fixed sample from the first frame and two representative samples from different modalities.