Search Results for author: Yanpeng Sun

Found 15 papers, 7 papers with code

MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams

no code implementations26 Mar 2025 Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, Anton Van Den Hengel

Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements.

Mathematical Reasoning Object Counting

Visual Position Prompt for MLLM based Visual Grounding

1 code implementation19 Mar 2025 Wei Tang, Yanpeng Sun, Qinying Gu, Zechao Li

We also introduce a VPP-SFT dataset with 0. 6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training.

Position Visual Grounding

Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs

1 code implementation11 Jan 2025 Shan Zhang, Aotian Chen, Yanpeng Sun, Jindong Gu, Yi-Yu Zheng, Piotr Koniusz, Kai Zou, Anton Van Den Hengel, Yuan Xue

Current multimodal large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.

Math Mathematical Problem-Solving +2

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

1 code implementation18 Dec 2024 Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang

We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption.

Descriptive Human-Object Interaction Detection +2

Continual SFT Matches Multimodal RLHF with Negative Supervision

no code implementations22 Nov 2024 Ke Zhu, Yu Wang, Yanpeng Sun, Qiang Chen, JiangJiang Liu, Gang Zhang, Jingdong Wang

Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss.

Improving Multi-modal Large Language Model through Boosting Vision Capabilities

no code implementations17 Oct 2024 Yanpeng Sun, Huaxin Zhang, Qiang Chen, Xinyu Zhang, Nong Sang, Gang Zhang, Jingdong Wang, Zechao Li

QLadder employs a learnable ``\textit{ladder}'' structure to deeply aggregates the intermediate representations from the frozen pretrained visual encoder (e. g., CLIP image encoder).

Decoder Language Modeling +3

FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

no code implementations20 Sep 2024 Jing Hao, Yuxiang Zhao, Song Chen, Yanpeng Sun, Qiang Chen, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang

To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions.

Image Captioning Image Comprehension

CSGO: Content-Style Composition in Text-to-Image Generation

no code implementations29 Aug 2024 Peng Xing, Haofan Wang, Yanpeng Sun, Qixun Wang, Xu Bai, Hao Ai, Renyuan Huang, Zechao Li

Based on this pipeline, we construct a dataset IMAGStyle, the first large-scale style transfer dataset containing 210k image triplets, available for the community to explore and research.

Style Transfer Text-to-Image Generation

VRP-SAM: SAM with Visual Reference Prompt

1 code implementation CVPR 2024 Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li

In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model.

Meta-Learning Segmentation

Exploring Effective Factors for Improving Visual In-Context Learning

1 code implementation10 Apr 2023 Yanpeng Sun, Qiang Chen, Jian Wang, Jingdong Wang, Zechao Li

By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks.

In-Context Learning Meta-Learning +1

s-Adaptive Decoupled Prototype for Few-Shot Object Detection

no code implementations ICCV 2023 Jinhao Du, Shan Zhang, Qiang Chen, Haifeng Le, Yanpeng Sun, Yao Ni, Jian Wang, Bin He, Jingdong Wang

To provide precise information for the query image, the prototype is decoupled into task-specific ones, which provide tailored guidance for 'where to look' and 'what to look for', respectively.

Few-Shot Object Detection Meta-Learning +3

Self-Supervised Guided Segmentation Framework for Unsupervised Anomaly Detection

no code implementations26 Sep 2022 Peng Xing, Yanpeng Sun, Zechao Li

In this paper, a novel Self-Supervised Guided Segmentation Framework (SGSF) is proposed by jointly exploring effective generation method of forged anomalous samples and the normal sample features as the guidance information of segmentation for anomaly detection.

Segmentation Unsupervised Anomaly Detection

SSA: Semantic Structure Aware Inference for Weakly Pixel-Wise Dense Predictions without Cost

no code implementations5 Nov 2021 Yanpeng Sun, Zechao Li

The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps (CAM) to generate pseudo masks as ground-truth.

Weakly-Supervised Object Localization Weakly supervised Semantic Segmentation +1

CTNet: Context-based Tandem Network for Semantic Segmentation

1 code implementation20 Apr 2021 Zechao Li, Yanpeng Sun, Jinhui Tang

Specifically, the Spatial Contextual Module (SCM) is leveraged to uncover the spatial contextual dependency between pixels by exploring the correlation between pixels and categories.

Segmentation Semantic Segmentation

Cannot find the paper you are looking for? You can Submit a new open access paper.