no code implementations • 26 Mar 2025 • Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, Anton Van Den Hengel
Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements.
1 code implementation • 19 Mar 2025 • Wei Tang, Yanpeng Sun, Qinying Gu, Zechao Li
We also introduce a VPP-SFT dataset with 0. 6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training.
1 code implementation • 11 Jan 2025 • Shan Zhang, Aotian Chen, Yanpeng Sun, Jindong Gu, Yi-Yu Zheng, Piotr Koniusz, Kai Zou, Anton Van Den Hengel, Yuan Xue
Current multimodal large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.
1 code implementation • 18 Dec 2024 • Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang
We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption.
no code implementations • 22 Nov 2024 • Ke Zhu, Yu Wang, Yanpeng Sun, Qiang Chen, JiangJiang Liu, Gang Zhang, Jingdong Wang
Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss.
no code implementations • 17 Oct 2024 • Yanpeng Sun, Huaxin Zhang, Qiang Chen, Xinyu Zhang, Nong Sang, Gang Zhang, Jingdong Wang, Zechao Li
QLadder employs a learnable ``\textit{ladder}'' structure to deeply aggregates the intermediate representations from the frozen pretrained visual encoder (e. g., CLIP image encoder).
Ranked #165 on
Visual Question Answering
on MM-Vet
no code implementations • 20 Sep 2024 • Jing Hao, Yuxiang Zhao, Song Chen, Yanpeng Sun, Qiang Chen, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang
To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions.
no code implementations • 29 Aug 2024 • Peng Xing, Haofan Wang, Yanpeng Sun, Qixun Wang, Xu Bai, Hao Ai, Renyuan Huang, Zechao Li
Based on this pipeline, we construct a dataset IMAGStyle, the first large-scale style transfer dataset containing 210k image triplets, available for the community to explore and research.
1 code implementation • CVPR 2024 • Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li
In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model.
1 code implementation • 10 Apr 2023 • Yanpeng Sun, Qiang Chen, Jian Wang, Jingdong Wang, Zechao Li
By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks.
no code implementations • ICCV 2023 • Jinhao Du, Shan Zhang, Qiang Chen, Haifeng Le, Yanpeng Sun, Yao Ni, Jian Wang, Bin He, Jingdong Wang
To provide precise information for the query image, the prototype is decoupled into task-specific ones, which provide tailored guidance for 'where to look' and 'what to look for', respectively.
no code implementations • 26 Sep 2022 • Peng Xing, Yanpeng Sun, Zechao Li
In this paper, a novel Self-Supervised Guided Segmentation Framework (SGSF) is proposed by jointly exploring effective generation method of forged anomalous samples and the normal sample features as the guidance information of segmentation for anomaly detection.
1 code implementation • 13 Jun 2022 • Yanpeng Sun, Qiang Chen, Xiangyu He, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jian Cheng, Zechao Li, Jingdong Wang
In this paper, we rethink the paradigm and explore a new regime: {\em fine-tuning a small part of parameters in the backbone}.
Ranked #15 on
Few-Shot Semantic Segmentation
on COCO-20i (1-shot)
no code implementations • 5 Nov 2021 • Yanpeng Sun, Zechao Li
The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps (CAM) to generate pseudo masks as ground-truth.
Weakly-Supervised Object Localization
Weakly supervised Semantic Segmentation
+1
1 code implementation • 20 Apr 2021 • Zechao Li, Yanpeng Sun, Jinhui Tang
Specifically, the Spatial Contextual Module (SCM) is leveraged to uncover the spatial contextual dependency between pixels by exploring the correlation between pixels and categories.
Ranked #76 on
Semantic Segmentation
on ADE20K val