Search Results for author: Shitian Zhao

Found 11 papers, 6 papers with code

OmniCaptioner: One Captioner to Rule Them All

1 code implementation9 Apr 2025 Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Bo Zhang, Peng Gao

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains.

All Image Captioning +2

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

no code implementations27 Mar 2025 Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Peng Gao, Bin Fu, Zhen Li

We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity.

Image Generation Text Generation

Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning

1 code implementation20 Mar 2025 Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, Kaipeng Zhang

Experiments show that it converges to a specific prompt depending on model capability and task complexity, achieving comparable or better performance than both Thinking and No-Thinking-RL.

Classification Few-Shot Learning +4

Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

no code implementations18 Mar 2025 Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Xiaofeng Yang

Vision-language models (VLMs) have advanced reasoning in natural scenes, but their role in medical imaging remains underexplored.

Anatomy Attribute +1

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

no code implementations23 Jan 2025 Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li

Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation.

Depth Estimation Image Generation +3

Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models

1 code implementation1 Oct 2024 Shitian Zhao, Renrui Zhang, Xu Luo, Yan Wang, Shanghang Zhang, Peng Gao

In this framework, people can propose new basic composition methods and combine them to get the new mixed composition methods.

Question Answering Visual Question Answering

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

1 code implementation23 Sep 2024 Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Junlin Xie, Yu Qiao, Peng Gao, Hongsheng Li

Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes.

Image Restoration Text-to-Image Generation

Boosting Open-Domain Continual Learning via Leveraging Intra-domain Category-aware Prototype

no code implementations19 Aug 2024 Yadong Lu, Shitian Zhao, Boxiang Yun, Dongsheng Jiang, Yin Li, Qingli Li, Yan Wang

Despite recent progress in enhancing the efficacy of Open-Domain Continual Learning (ODCL) in Vision-Language Models (VLM), failing to (1) correctly identify the Task-ID of a test image and (2) use only the category set corresponding to the Task-ID, while preserving the knowledge related to each domain, cannot address the two primary challenges of ODCL: forgetting old knowledge and maintaining zero-shot capabilities, as well as the confusions caused by category-relatedness between domains.

Continual Learning

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

2 code implementations5 Aug 2024 Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, Peng Gao

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.

Decoder Depth Estimation +3

Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models

no code implementations CVPR 2024 Shitian Zhao, Zhuowan Li, Yadong Lu, Alan Yuille, Yan Wang

We propose Causal Context Generation, Causal-CoG, which is a prompting strategy that engages contextual information to enhance precise VQA during inference.

Question Answering Visual Question Answering

Cannot find the paper you are looking for? You can Submit a new open access paper.