Search Results for author: Jianhua Han

Found 39 papers, 7 papers with code

UNIT: Unifying Image and Text Recognition in One Vision Encoder

no code implementations6 Sep 2024 Yi Zhu, Yanpeng Zhou, Chunwei Wang, Yang Cao, Jianhua Han, Lu Hou, Hang Xu

Starting with a vision encoder pre-trained with image recognition tasks, UNIT introduces a lightweight language decoder for predicting text outputs and a lightweight vision decoder to prevent catastrophic forgetting of the original image encoding capabilities.

Decoder Optical Character Recognition (OCR)

EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

no code implementations23 Aug 2024 Cong Wang, Jiaxi Gu, Panwen Hu, Haoyu Zhao, Yuanfan Guo, Jianhua Han, Hang Xu, Xiaodan Liang

Specifically, for the sketch-to-video generation task, EasyControl achieves an improvement of 152. 0 on FVD and 19. 9 on IS, respectively, in UCF101 compared with VideoComposer.

Image Generation Video Generation

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

no code implementations11 Jul 2024 Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei zhang, Xiaodan Liang

High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities.

Position

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

no code implementations18 Mar 2024 Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei zhang, Hang Xu

Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer.

Image Generation Style Transfer

NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

1 code implementation12 Mar 2024 Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, Xiaodan Liang

Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.

Navigate Vision and Language Navigation

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

no code implementations27 Dec 2023 Guansong Lu, Yuanfan Guo, Jianhua Han, Minzhe Niu, Yihan Zeng, Songcen Xu, Zeyi Huang, Zhao Zhong, Wei zhang, Hang Xu

Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges.

Computational Efficiency Denoising +1

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

1 code implementation18 Dec 2023 Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, YuFei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong

We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships.

Language Modelling Large Language Model +1

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

1 code implementation6 Dec 2023 Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, Li Zhang

Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior.

Autonomous Driving Decision Making

Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis

no code implementations16 Oct 2023 Kai Chen, Chunwei Wang, Kuo Yang, Jianhua Han, Lanqing Hong, Fei Mi, Hang Xu, Zhengying Liu, Wenyong Huang, Zhenguo Li, Dit-yan Yeung, Lifeng Shang, Xin Jiang, Qun Liu

The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.

Instruction Following

Implicit Concept Removal of Diffusion Models

no code implementations9 Oct 2023 Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-yan Yeung, James Kwok

Subsequently, the model is optimized to identify and disentangle this information, which is then adopted as negative prompts during generation.

HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

no code implementations11 Sep 2023 Xinpeng Ding, Jianhua Han, Hang Xu, Wei zhang, Xiaomeng Li

For the first time, we leverage singular multimodal large language models (MLLMs) to consolidate multiple autonomous driving tasks from videos, i. e., the Risk Object Localization and Intention and Suggestion Prediction (ROLISP) task.

Autonomous Driving Object Localization

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

no code implementations31 Aug 2023 Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu

Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes.

Image Generation

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

no code implementations ICCV 2023 Runhui Huang, Jianhua Han, Guansong Lu, Xiaodan Liang, Yihan Zeng, Wei zhang, Hang Xu

DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image.

Image Generation Zero-Shot Learning

CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation

no code implementations17 Jun 2023 Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, Xiaodan Liang

Understanding and following natural language instructions while navigating through complex, real-world environments poses a significant challenge for general-purpose robots.

Decision Making Instruction Following +4

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

1 code implementation31 May 2023 Guian Fang, Zutao Jiang, Jianhua Han, Guansong Lu, Hang Xu, Shengcai Liao, Xiaodan Liang

Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from textual descriptions.

Caption Generation Language Modelling +3

DetGPT: Detect What You Need via Reasoning

1 code implementation23 May 2023 Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, Tong Zhang

Overall, our proposed paradigm and DetGPT demonstrate the potential for more sophisticated and intuitive interactions between humans and machines.

Autonomous Driving Object +2

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment

no code implementations CVPR 2023 Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei zhang, Zhenguo Li, Hang Xu

This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD).

Language Modelling object-detection +1

Towards Universal Vision-language Omni-supervised Segmentation

no code implementations12 Mar 2023 Bowen Dong, Jiaxi Gu, Jianhua Han, Hang Xu, WangMeng Zuo

To improve the open-world segmentation ability, we leverage omni-supervised data (i. e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability and achieving better segmentation accuracy.

Instance Segmentation object-detection +4

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

no code implementations CVPR 2023 Yanxin Long, Youpeng Wen, Jianhua Han, Hang Xu, Pengzhen Ren, Wei zhang, Shen Zhao, Xiaodan Liang

Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e. g., 15. 44% mAP on VG V1. 2 and 13. 98% on the VG-COCO dataset.

Dense Captioning

Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving

no code implementations CVPR 2023 Xiwen Liang, Minzhe Niu, Jianhua Han, Hang Xu, Chunjing Xu, Xiaodan Liang

Multi-task learning has emerged as a powerful paradigm to solve a range of tasks simultaneously with good efficiency in both computation resources and inference time.

Autonomous Driving Lane Detection +4

CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data

no code implementations CVPR 2023 Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-yan Yeung, Zhen Yang, Xiaodan Liang, Hang Xu

Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks.

3D geometry

NLIP: Noise-robust Language-Image Pre-training

no code implementations14 Dec 2022 Runhui Huang, Yanxin Long, Jianhua Han, Hang Xu, Xiwen Liang, Chunjing Xu, Xiaodan Liang

Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e. g., zero-shot classification, retrieval and image captioning.

Image Captioning Image-text Retrieval +3

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

no code implementations2 Nov 2022 Yanxin Long, Jianhua Han, Runhui Huang, Xu Hang, Yi Zhu, Chunjing Xu, Xiaodan Liang

Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo labels for unseen classes in a self-training manner.

Object object-detection +5

DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection

no code implementations20 Sep 2022 Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei zhang, Zhenguo Li, Chunjing Xu, Hang Xu

We further design a concept dictionary~(with descriptions) from various online sources and detection datasets to provide prior knowledge for each concept.

object-detection Open World Object Detection

Effective Adaptation in Multi-Task Co-Training for Unified Autonomous Driving

no code implementations19 Sep 2022 Xiwen Liang, Yangxin Wu, Jianhua Han, Hang Xu, Chunjing Xu, Xiaodan Liang

Aiming towards a holistic understanding of multiple downstream tasks simultaneously, there is a need for extracting features with better transferability.

Autonomous Driving Multi-Task Learning +4

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

no code implementations18 Jul 2022 Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, Xiaodan Liang

To bridge the gap between supervised semantic segmentation and real-world applications that acquires one model to recognize arbitrary new concepts, recent zero-shot segmentation attracts a lot of attention by exploring the relationships between unseen and seen object categories, yet requiring large amounts of densely-annotated data with diverse base classes.

Clustering Online Clustering +3

Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing

no code implementations26 May 2022 Zhili Liu, Jianhua Han, Lanqing Hong, Hang Xu, Kai Chen, Chunjing Xu, Zhenguo Li

On the other hand, for existing SSL methods, it is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.

Self-Supervised Learning

Laneformer: Object-aware Row-Column Transformers for Lane Detection

no code implementations18 Mar 2022 Jianhua Han, Xiajun Deng, Xinyue Cai, Zhen Yang, Hang Xu, Chunjing Xu, Xiaodan Liang

We present Laneformer, a conceptually simple yet powerful transformer-based architecture tailored for lane detection that is a long-standing research topic for visual perception in autonomous driving.

Autonomous Driving Decoder +2

CODA: A Real-World Road Corner Case Dataset for Object Detection in Autonomous Driving

no code implementations15 Mar 2022 Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chaoqiang Ye, Jianhua Han, Yukuai Chen, Wei zhang, Chunjing Xu, Dit-yan Yeung, Xiaodan Liang, Zhenguo Li, Hang Xu

One main reason that impedes the development of truly reliably self-driving systems is the lack of public datasets for evaluating the performance of object detectors on corner cases.

Autonomous Driving Object +2

SODA10M: A Large-Scale 2D Self/Semi-Supervised Object Detection Dataset for Autonomous Driving

no code implementations21 Jun 2021 Jianhua Han, Xiwen Liang, Hang Xu, Kai Chen, Lanqing Hong, Jiageng Mao, Chaoqiang Ye, Wei zhang, Zhenguo Li, Xiaodan Liang, Chunjing Xu

Experiments show that SODA10M can serve as a promising pre-training dataset for different self-supervised learning methods, which gives superior performance when fine-tuning with different downstream tasks (i. e., detection, semantic/instance segmentation) in autonomous driving domain.

Autonomous Driving Instance Segmentation +5

Cannot find the paper you are looking for? You can Submit a new open access paper.