1 code implementation • 14 Apr 2025 • Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue
Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment.
no code implementations • 23 Feb 2025 • Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, Jianlan Luo
Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills.
1 code implementation • 3 Dec 2024 • Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1. 5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities.
1 code implementation • 17 Oct 2024 • Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue
To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs.
1 code implementation • CVPR 2024 • Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue
In detail, we first train an image projection module to connect a vision encoder with LLM.
Ranked #211 on
Visual Question Answering
on MM-Vet
1 code implementation • 13 Nov 2023 • Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, Yu Qiao
We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Ranked #2 on
Visual Question Answering
on BenchLMM
(using extra training data)
2 code implementations • 7 Sep 2023 • Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao
During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder.
5 code implementations • 1 Sep 2023 • Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video.
Ranked #5 on
3D Question Answering (3D-QA)
on 3D MM-Vet
3 code implementations • 28 Apr 2023 • Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao
This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
Ranked #6 on
Visual Question Answering (VQA)
on InfiMM-Eval
7 code implementations • 28 Mar 2023 • Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Yu Qiao
We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model.
Ranked #2 on
Music Question Answering
on MusicQA
1 code implementation • 31 Jan 2023 • Jiaming Han, Yuqiang Ren, Jian Ding, Ke Yan, Gui-Song Xia
As few-shot object detectors are often trained with abundant base samples and fine-tuned on few-shot novel examples, the learned models are usually biased to base classes and sensitive to the variance of novel examples.
1 code implementation • CVPR 2022 • Jiaming Han, Yuqiang Ren, Jian Ding, Xingjia Pan, Ke Yan, Gui-Song Xia
Thus, unknown objects in low-density regions can be easily identified with the learned unknown probability.
1 code implementation • 30 Aug 2021 • Gui-Song Xia, Jian Ding, Ming Qian, Nan Xue, Jiaming Han, Xiang Bai, Michael Ying Yang, Shengyang Li, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, Liangpei Zhang, Qiang Zhou, Chao-hui Yu, Kaixuan Hu, Yingjia Bu, Wenming Tan, Zhe Yang, Wei Li, Shang Liu, Jiaxuan Zhao, Tianzhi Ma, Zi-han Gao, Lingqi Wang, Yi Zuo, Licheng Jiao, Chang Meng, Hao Wang, Jiahao Wang, Yiming Hui, Zhuojun Dong, Jie Zhang, Qianyue Bao, Zixiao Zhang, Fang Liu
This report summarizes the results of Learning to Understand Aerial Images (LUAI) 2021 challenge held on ICCV 2021, which focuses on object detection and semantic segmentation in aerial images.
4 code implementations • CVPR 2021 • Jiaming Han, Jian Ding, Nan Xue, Gui-Song Xia
More precisely, we incorporate rotation-equivariant networks into the detector to extract rotation-equivariant features, which can accurately predict the orientation and lead to a huge reduction of model size.
Ranked #21 on
Object Detection In Aerial Images
on DOTA
(using extra training data)
3 code implementations • 21 Aug 2020 • Jiaming Han, Jian Ding, Jie Li, Gui-Song Xia
However most of existing methods rely on heuristically defined anchors with different scales, angles and aspect ratios and usually suffer from severe misalignment between anchor boxes and axis-aligned convolutional features, which leads to the common inconsistency between the classification score and localization accuracy.
Ranked #26 on
Object Detection In Aerial Images
on DOTA
(using extra training data)