no code implementations • 18 Mar 2024 • Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, Wenjun Zeng
Humans constantly interact with their surrounding environments.
1 code implementation • 16 Jan 2024 • Yukun Su, Yiwen Cao, Jingliang Deng, Fengyun Rao, Qingyao Wu
A large amount of User Generated Content (UGC) is uploaded to the Internet daily and displayed to people world-widely through the client side (e. g., mobile and PC).
no code implementations • 26 Dec 2023 • Liang Xu, Xintao Lv, Yichao Yan, Xin Jin, Shuwen Wu, Congsheng Xu, Yifan Liu, Yizhou Zhou, Fengyun Rao, Xingdong Sheng, Yunhui Liu, Wenjun Zeng, Xiaokang Yang
We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions, semantic interaction categories, interaction order, and the relationship and personality of the subjects.
no code implementations • 29 May 2023 • Feipeng Ma, Yizhou Zhou, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun
This potential can be harnessed to create synthetic image-text pairs for training captioning models.
1 code implementation • 25 May 2023 • Zhenhua Liu, Feipeng Ma, Tianyi Wang, Fengyun Rao
We propose a Similarity Alignment Model(SAM) for video copy segment matching.
1 code implementation • 21 May 2023 • Tianyi Wang, Feipeng Ma, Zhenhua Liu, Fengyun Rao
With the development of multimedia technology, Video Copy Detection has been a crucial problem for social media platforms.
no code implementations • CVPR 2022 • Zhaoyang Zeng, Yongsheng Luo, Zhenhua Liu, Fengyun Rao, Dian Li, Weidong Guo, Zhen Wen
In this paper, we propose the Tencent-MVSE dataset, which is the first benchmark dataset for the multi-modal video similarity evaluation task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 9 Dec 2021 • Lu Qi, Jason Kuen, Zhe Lin, Jiuxiang Gu, Fengyun Rao, Dian Li, Weidong Guo, Zhen Wen, Ming-Hsuan Yang, Jiaya Jia
To improve instance-level detection/segmentation performance, existing self-supervised and semi-supervised methods extract either task-unrelated or task-specific training signals from unlabeled data.
no code implementations • 13 Oct 2021 • Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, Xiu Li
It is noted that our model is only trained on the MSR-VTT dataset.
no code implementations • 11 Oct 2021 • Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, Dian Li
We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the following improvements: 1) we utilize three strong pre-trained CLIP models to extract the text-related appearance visual features.