no code implementations • 2 Apr 2025 • Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei zhang, Lanqing Hong, Hengshuang Zhao, Hang Xu
We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation.
no code implementations • 9 Mar 2025 • Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, Xiandan Liang
We present SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete feature representations for multimodal understanding and generation tasks.
no code implementations • 5 Feb 2025 • Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Wei zhang, Hang Xu, Li Zhang
Text-driven video generation has advanced significantly due to developments in diffusion models.
no code implementations • 6 Jan 2025 • Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Hang Xu, Li Zhang
However, training models for long video generation demands significant computational power and extensive data, leading most video diffusion models to be limited to a small number of frames.
no code implementations • 9 Dec 2024 • Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei zhang, Hang Xu
In this paper, we introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language model through a unified next-token prediction formulation.
Ranked #140 on
Visual Question Answering
on MM-Vet
no code implementations • 26 Sep 2024 • Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-yan Yeung, Xiao Chen, Zhenguo Li, Wei zhang, Qun Liu, Jun Yao, Lanqing Hong, Lu Hou, Hang Xu
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models.
no code implementations • 6 Sep 2024 • Yi Zhu, Yanpeng Zhou, Chunwei Wang, Yang Cao, Jianhua Han, Lu Hou, Hang Xu
Starting with a vision encoder pre-trained with image recognition tasks, UNIT introduces a lightweight language decoder for predicting text outputs and a lightweight vision decoder to prevent catastrophic forgetting of the original image encoding capabilities.
no code implementations • 11 Jul 2024 • Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei zhang, Xiaodan Liang
High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities.
no code implementations • 28 Feb 2024 • Yulong Liu, Yunlong Yuan, Chunwei Wang, Jianhua Han, Yongqiang Ma, Li Zhang, Nanning Zheng, Hang Xu
In this work, we introduce a novel tool invocation pipeline designed to control massive real-world APIs.
1 code implementation • 6 Dec 2023 • Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, Li Zhang
Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior.
no code implementations • 16 Oct 2023 • Kai Chen, Chunwei Wang, Kuo Yang, Jianhua Han, Lanqing Hong, Fei Mi, Hang Xu, Zhengying Liu, Wenyong Huang, Zhenguo Li, Dit-yan Yeung, Lifeng Shang, Xin Jiang, Qun Liu
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
1 code implementation • ICCV 2023 • Ming Nie, Yujing Xue, Chunwei Wang, Chaoqiang Ye, Hang Xu, Xinge Zhu, Qingqiu Huang, Michael Bi Mi, Xinchao Wang, Li Zhang
Recently, polar-based representation has shown promising properties in perceptual tasks.
no code implementations • CVPR 2022 • Yihan Zeng, Da Zhang, Chunwei Wang, Zhenwei Miao, Ting Liu, Xin Zhan, Dayang Hao, Chao Ma
LiDAR and camera are two common sensors to collect data in time for 3D object detection under the autonomous driving context.
no code implementations • NeurIPS 2021 • Zeng Yihan, Chunwei Wang, Yunbo Wang, Hang Xu, Chaoqiang Ye, Zhen Yang, Chao Ma
First, 3D-CoCo is inspired by our observation that the bird-eye-view (BEV) features are more transferable than low-level geometry features.
no code implementations • CVPR 2021 • Chunwei Wang, Chao Ma, Ming Zhu, Xiaokang Yang
On one hand, PointAugmenting decorates point clouds with corresponding point-wise CNN features extracted by pretrained 2D detection models, and then performs 3D object detection over the decorated point clouds.