1 code implementation • 18 Jul 2024 • Taian Guo, Taolin Zhang, Haoqian Wu, Hanjun Li, Ruizhi Qiao, Xing Sun
Conventional multi-label recognition methods often focus on label confidence, frequently overlooking the pivotal role of partial order relations consistent with human preference.
1 code implementation • 29 Jun 2024 • Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options.
no code implementations • 27 Aug 2023 • Xiujun Shu, Wei Wen, Liangsheng Xu, Ruizhi Qiao, Taian Guo, Hanjun Li, Bei Gan, Xiao Wang, Xing Sun
In this paper, we present a unified and dynamic graph (UniDG) framework for temporal character grouping.
1 code implementation • ICCV 2023 • Hanjun Li, Xiujun Shu, Sunan He, Ruizhi Qiao, Wei Wen, Taian Guo, Bei Gan, Xing Sun
Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA).
Ranked #10 on Temporal Sentence Grounding on Charades-STA
no code implementations • 19 Aug 2022 • Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Chen Wu, Xiujun Shu, Bo Ren
Image and language modeling is of crucial importance for vision-language pre-training (VLP), which aims to learn multi-modal representations from large-scale paired image-text data.
no code implementations • 12 Aug 2022 • Xiujun Shu, Wei Wen, Taian Guo, Sunan He, Chen Wu, Ruizhi Qiao
This technical report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022.
1 code implementation • 5 Jul 2022 • Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Bo Ren, Shu-Tao Xia
Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model.
Ranked #1 on Multi-label zero-shot learning on Open Images V4
1 code implementation • ECCV 2020 • Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, Jiaya Jia
Motivated by these findings, we propose a temporal multi-correspondence aggregation strategy to leverage similar patches across frames, and a cross-scale nonlocal-correspondence aggregation scheme to explore self-similarity of images across scales.