Search Results for author: Sunan He

Found 12 papers, 8 papers with code

MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book

1 code implementation1 Jun 2025 Sau Lai Yip, Sunan He, Yuxiang Nie, Shu Pui Chan, Yilin Ye, Sum Ying Lam, Hao Chen

Our findings highlight critical capability gaps in current GMAI systems while establishing textbook-derived multimodal benchmarks as essential evaluation tools.

Benchmarking

EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

no code implementations24 May 2025 Guanghao Meng, Sunan He, Jinpeng Wang, Tao Dai, Letian Zhang, Jieming Zhu, Qing Li, Gang Wang, Rui Zhang, Yong Jiang

To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries.

Image-text Retrieval Language Modeling +3

UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

1 code implementation30 Apr 2025 Linshan Wu, Yuxiang Nie, Sunan He, Jiaxin Zhuang, Hao Chen

UniBiomed is based on a novel integration of Multi-modal Large Language Model (MLLM) and Segment Anything Model (SAM), which effectively unifies the generation of clinical texts and the segmentation of corresponding biomedical objects for grounded interpretation.

Diagnostic Large Language Model +3

ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Contrastive Langauge-Image Pre-training

1 code implementation26 Jan 2025 Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Hao Chen

This dual alignment strategy enhances the model's capability to associate specific image regions with relevant concepts, thereby improving both the precision of analysis and the interpretability of the AI system.

Articles Concept Alignment +1

Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models

1 code implementation27 Aug 2023 Kaiyuan Gao, Sunan He, Zhenyu He, Jiacheng Lin, Qizhi Pei, Jie Shao, Wei zhang

Generative pre-trained transformer (GPT) models have revolutionized the field of natural language processing (NLP) with remarkable performance in various tasks and also extend their power to multimodal domains.

Survey

D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

1 code implementation ICCV 2023 Hanjun Li, Xiujun Shu, Sunan He, Ruizhi Qiao, Wei Wen, Taian Guo, Bei Gan, Xing Sun

Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA).

Contrastive Learning Sentence +1

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

no code implementations18 May 2023 Taolin Zhang, Sunan He, Dai Tao, Bin Chen, Zhi Wang, Shu-Tao Xia

In recent years, vision language pre-training frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks.

Contrastive Learning Object +2

VLMAE: Vision-Language Masked Autoencoder

no code implementations19 Aug 2022 Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Chen Wu, Xiujun Shu, Bo Ren

Image and language modeling is of crucial importance for vision-language pre-training (VLP), which aims to learn multi-modal representations from large-scale paired image-text data.

Image-text Retrieval Language Modeling +5

Exploiting Feature Diversity for Make-up Temporal Video Grounding

no code implementations12 Aug 2022 Xiujun Shu, Wei Wen, Taian Guo, Sunan He, Chen Wu, Ruizhi Qiao

This technical report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022.

Diversity Video Grounding

Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

1 code implementation5 Jul 2022 Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Bo Ren, Shu-Tao Xia

Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model.

Image-text matching Knowledge Distillation +8

Cannot find the paper you are looking for? You can Submit a new open access paper.