Search Results for author: Shanghang Zhang

Found 123 papers, 55 papers with code

MAVIS: Mathematical Visual Instruction Tuning

1 code implementation11 Jul 2024 Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li

We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills.

Contrastive Learning Language Modelling +3

Fisher-aware Quantization for DETR Detectors with Critical-category Objectives

no code implementations3 Jul 2024 Huanrui Yang, Yafeng Huang, Zhen Dong, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Yuan Du, Kurt Keutzer, Shanghang Zhang

We analyze the impact of quantization at the category-level granularity, and propose methods to improve performance for the critical categories.

object-detection Object Detection +1

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

no code implementations22 Jun 2024 Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, Shanghang Zhang

In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation.

Common Sense Reasoning Language Modelling +6

RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

no code implementations6 Jun 2024 Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Lily Lee, Kaichen Zhou, Pengju An, Senqiao Yang, Renrui Zhang, Yandong Guo, Shanghang Zhang

Inspired by this, we introduce RoboMamba, an end-to-end robotic MLLM that leverages the Mamba model to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference.

Common Sense Reasoning Pose Prediction +2

Implicit Neural Image Field for Biological Microscopy Image Compression

1 code implementation29 May 2024 Gaole Dai, Cheng-Ching Tseng, Qingpo Wuwu, Rongyu Zhang, Shaokang Wang, Ming Lu, Tiejun Huang, Yu Zhou, Ali Ata Tuz, Matthias Gunzer, Jianxu Chen, Shanghang Zhang

The rapid pace of innovation in biological microscopy imaging has led to large images, putting pressure on data storage and impeding efficient sharing, management, and visualization.

Image Compression Management

Compositional Few-Shot Class-Incremental Learning

no code implementations27 May 2024 Yixiong Zou, Shanghang Zhang, Haichen Zhou, Yuhua Li, Ruixuan Li

Few-shot class-incremental learning (FSCIL) is proposed to continually learn from novel classes with only a few samples after the (pre-)training on base classes with sufficient data.

Few-Shot Class-Incremental Learning Incremental Learning

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

1 code implementation26 May 2024 Rongyu Zhang, Aosong Cheng, Yulin Luo, Gaole Dai, Huanrui Yang, Jiaming Liu, ran Xu, Li Du, Yuan Du, Yanbing Jiang, Shanghang Zhang

Continual Test-Time Adaptation (CTTA), which aims to adapt the pre-trained model to ever-evolving target domains, emerges as an important task for vision models.

feature selection Test-time Adaptation

Unveiling the Tapestry of Consistency in Large Vision-Language Models

1 code implementation23 May 2024 Yuan Zhang, Fei Xiao, Tao Huang, Chun-Kai Fan, Hongyuan Dong, Jiawen Li, Jiacong Wang, Kuan Cheng, Shanghang Zhang, Haoyuan Guo

To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point.

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

no code implementations19 May 2024 Peng Li, YuAn Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wenhan Luo, Ping Tan, Wenping Wang, Qifeng Liu, Yike Guo

Specifically, these methods assume that the input images should comply with a predefined camera type, e. g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails.

Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning

no code implementations13 Apr 2024 Yijiang Liu, Rongyu Zhang, Huanrui Yang, Kurt Keutzer, Yuan Du, Li Du, Shanghang Zhang

Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications, ranging from content generation to interactive entertainment, and artistic creation.


Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

7 code implementations11 Apr 2024 Yiwen Tang, Ray Zhang, Jiaming Liu, Zoey Guo, Dong Wang, Zhigang Wang, Bin Zhao, Shanghang Zhang, Peng Gao, Hongsheng Li, Xuelong Li

The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers.

SpikeNVS: Enhancing Novel View Synthesis from Blurry Images via Spike Camera

no code implementations10 Apr 2024 Gaole Dai, Zhenyu Wang, Qinwen Xu, Ming Lu, Wen Chen, Boxin Shi, Shanghang Zhang, Tiejun Huang

Since the spike camera relies on temporal integration instead of temporal differentiation used by event cameras, our proposed TfS loss maintains manageable training costs.

Novel View Synthesis

Jump Self-attention: Capturing High-order Statistics in Transformers

no code implementations journal 2024 Haoyi Zhou, Siyang Xiao, Shanghang Zhang, Jieqi Peng, Shuai Zhang, JianXin Li

However, the strong assumption that elements are directly attentive to each other limits the performance of tasks with high-order dependencies such as natural language understanding and Image captioning.

Image Captioning Natural Language Understanding

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

1 code implementation29 Mar 2024 Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li

In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.

Instruction Following Language Modelling +5

Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection

no code implementations22 Mar 2024 Hongzhi Gao, Zheng Chen, Zehui Chen, Lin Chen, Jiaming Liu, Shanghang Zhang, Feng Zhao

Training high-accuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming.

3D Object Detection object-detection +2

DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments

no code implementations29 Feb 2024 Ji Ma, Hongming Dai, Yao Mu, Pengying Wu, Hao Wang, Xiaowei Chi, Yang Fei, Shanghang Zhang, Chang Liu

Zero-Shot Object Navigation (ZSON) requires agents to autonomously locate and approach unseen objects in unfamiliar environments and has emerged as a particularly challenging task within the domain of Embodied AI.

Attribute Collision Avoidance +3

A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to 1st VCL Challenge -- Multi-Task Robustness Track

no code implementations27 Feb 2024 Zehui Chen, Qiuchen Wang, Zhenyu Li, Jiaming Liu, Shanghang Zhang, Feng Zhao

In this report, we present our solution to the multi-task robustness track of the 1st Visual Continual Learning (VCL) Challenge at ICCV 2023 Workshop.

3D Object Detection Continual Learning +5

Building Flexible Machine Learning Models for Scientific Computing at Scale

no code implementations25 Feb 2024 Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Chonghan Gao, Shanghang Zhang, JianXin Li

Foundation models have revolutionized knowledge acquisition across domains, and our study introduces OmniArch, a paradigm-shifting approach designed for building foundation models in multi-physics scientific computing.

Zero-Shot Learning

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

1 code implementation31 Jan 2024 Jianing Li, Xi Nan, Ming Lu, Li Du, Shanghang Zhang

To overcome this limitation in MLLMs, we introduce Proximity Question Answering (Proximity QA), a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images.

Multi-Task Learning Question Answering +1

RustNeRF: Robust Neural Radiance Field with Low-Quality Images

no code implementations6 Jan 2024 Mengfei Li, Ming Lu, Xiaofang Li, Shanghang Zhang

First, existing methods assume enough high-quality images are available for training the NeRF model, ignoring real-world image degradation.

Novel View Synthesis

VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

no code implementations5 Jan 2024 Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, Chang Liu

In the realm of household robotics, the Zero-Shot Object Navigation (ZSON) task empowers agents to adeptly traverse unfamiliar environments and locate objects from novel categories without prior explicit training.

Language Modelling Large Language Model

A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models

1 code implementation4 Jan 2024 Rui Ma, Qiang Zhou, Yizhu Jin, Daquan Zhou, Bangjun Xiao, Xiuyu Li, Yi Qu, Aishani Singh, Kurt Keutzer, Jingtong Hu, Xiaodong Xie, Zhen Dong, Shanghang Zhang, Shiji Zhou

Notably, models like stable diffusion, which excel in text-to-image synthesis, heighten the risk of copyright infringement and unauthorized distribution. Machine unlearning, which seeks to eradicate the influence of specific data or concepts from machine learning models, emerges as a promising solution by eliminating the \enquote{copyright memories} ingrained in diffusion models.

Text-to-Image Generation

PromptCoT: Align Prompt Distribution via Adapted Chain-of-Thought

no code implementations CVPR 2024 Junyi Yao, Yijiang Liu, Zhen Dong, Mingfei Guo, Helan Hu, Kurt Keutzer, Li Du, Daquan Zhou, Shanghang Zhang

Considering computational efficiency instead of allocating a dedicated LLM for prompt enhancement to each individual model or dataset we integrate adapters that facilitate dataset-specific adaptation leveraging a shared pre-trained LLM as the foundation for this process.

Computational Efficiency Prompt Engineering +1

Cloud-Device Collaborative Learning for Multimodal Large Language Models

no code implementations CVPR 2024 Guanqun Wang, Jiaming Liu, Chenxuan Li, Junpeng Ma, Yuan Zhang, Xinyu Wei, Kevin Zhang, Maurice Chong, Ray Zhang, Yijiang Liu, Shanghang Zhang

However, the deployment of these large-scale MLLMs on client devices is hindered by their extensive model parameters, leading to a notable decline in generalization capabilities when these models are compressed for device deployment.

Device-Cloud Collaboration Knowledge Distillation +1

Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

no code implementations23 Dec 2023 Xinyan Chen, Jiaxin Ge, Tianjun Zhang, Jiaming Liu, Shanghang Zhang

IPR first samples a batch of images conditioned on the text then relabels the text prompts of unmatched text-image pairs with classifier feedback.

Image Generation reinforcement-learning +2

FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

no code implementations22 Dec 2023 Dongmei Zhang, Chang Li, Ray Zhang, Shenghao Xie, Wei Xue, Xiaodong Xie, Shanghang Zhang

In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets.

3D Object Detection 3D Open-Vocabulary Object Detection +2

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

no code implementations21 Dec 2023 Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi Li, Zehui Chen, Peng Gao, Yandong Guo, Shanghang Zhang

In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs to gain a comprehensive understanding of outdoor 3D scenes.

Instruction Following Language Modelling +1

Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation

no code implementations CVPR 2024 Jiaming Liu, ran Xu, Senqiao Yang, Renrui Zhang, Qizhe Zhang, Zehui Chen, Yandong Guo, Shanghang Zhang

To tackle these issues, we propose a continual self-supervised method, Adaptive Distribution Masked Autoencoders (ADMA), which enhances the extraction of target domain knowledge while mitigating the accumulation of distribution shifts.

Decoder Self-Supervised Learning +1

Customize-It-3D: High-Quality 3D Creation from A Single Image Using Subject-Specific Knowledge Prior

no code implementations15 Dec 2023 Nan Huang, Ting Zhang, Yuhui Yuan, Dong Chen, Shanghang Zhang

In this paper, we present a novel two-stage approach that fully utilizes the information provided by the reference image to establish a customized knowledge prior for image-to-3D generation.

3D Generation Image to 3D +1

Gradient-based Parameter Selection for Efficient Fine-Tuning

1 code implementation CVPR 2024 Zhi Zhang, Qizhe Zhang, Zijun Gao, Renrui Zhang, Ekaterina Shutova, Shiji Zhou, Shanghang Zhang

With the growing size of pre-trained models, full fine-tuning and storing all the parameters for various downstream tasks is costly and infeasible.

Image Classification Image Segmentation +2

Split-Ensemble: Efficient OOD-aware Ensemble via Task and Model Splitting

no code implementations14 Dec 2023 Anthony Chen, Huanrui Yang, Yulu Gan, Denis A Gudovskiy, Zhen Dong, Haofan Wang, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang

In particular, we build a tree-like Split-Ensemble architecture by performing iterative splitting and pruning from a shared backbone model, where each branch serves as a submodel corresponding to a subtask.

MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning

1 code implementation5 Dec 2023 Qizhe Zhang, Bocheng Zou, Ruichuan An, Jiaming Liu, Shanghang Zhang

Motivated by this, we propose Mixture of Sparse Adapters, or MoSA, as a novel Adapter Tuning method to fully unleash the potential of each parameter in the adapter.

MoEC: Mixture of Experts Implicit Neural Compression

no code implementations3 Dec 2023 Jianchen Zhao, Cheng-Ching Tseng, Ming Lu, Ruichuan An, Xiaobao Wei, He Sun, Shanghang Zhang

However, manually designing the partition scheme for a complex scene is very challenging and fails to jointly learn the partition and INRs.

Data Compression

M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

1 code implementation29 Nov 2023 Xiaowei Chi, Rongyu Zhang, Zhengkai Jiang, Yijiang Liu, Yatian Wang, Xingqun Qi, Wenhan Luo, Peng Gao, Shanghang Zhang, Qifeng Liu, Yike Guo

Moreover, to further enhance the effectiveness of $M^{3}Adapter$ while preserving the coherence of semantic context comprehension, we introduce a two-stage $M^{3}FT$ fine-tuning strategy.

Image Generation Language Modelling +1

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

no code implementations CVPR 2024 Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, Yike Guo

In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task.

Audio inpainting Gesture Generation

COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

no code implementations28 Nov 2023 Peidong Jia, Chenxuan Li, Yuhui Yuan, Zeyu Liu, Yichao Shen, Bohan Chen, Xingru Chen, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo

Our COLE system comprises multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for design-aware layer-wise captioning, layout planning, reasoning, and the task of generating images and text.

Image Generation

I-MedSAM: Implicit Medical Image Segmentation with Segment Anything

1 code implementation28 Nov 2023 Xiaobao Wei, Jiajun Cao, Yizhu Jin, Ming Lu, Guangyu Wang, Shanghang Zhang

To convert the SAM features and coordinates into continuous segmentation output, we utilize Implicit Neural Representation (INR) to learn an implicit segmentation decoder.

Decoder Image Segmentation +3

FreeKD: Knowledge Distillation via Semantic Frequency Prompt

1 code implementation CVPR 2024 Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, Shanghang Zhang

(2) During the distillation period, a pixel-wise frequency mask is generated via Frequency Prompt, to localize those pixel of interests (PoIs) in various frequency bands.

Knowledge Distillation

Heterogenous Memory Augmented Neural Networks

1 code implementation17 Oct 2023 Zihan Qiu, Zhen Liu, Shuicheng Yan, Shanghang Zhang, Jie Fu

It has been shown that semi-parametric methods, which combine standard neural networks with non-parametric components such as external memory modules and data retrieval, are particularly helpful in data scarcity and out-of-distribution (OOD) scenarios.


PAD: A Dataset and Benchmark for Pose-agnostic Anomaly Detection

1 code implementation NeurIPS 2023 Qiang Zhou, Weize Li, Lihan Jiang, Guoliang Wang, Guyue Zhou, Shanghang Zhang, Hao Zhao

Furthermore, we provide an open-source benchmark library, including dataset and baseline methods that cover 8 anomaly detection paradigms, to facilitate future research and application in this domain.

4k Anomaly Detection

Distribution-Aware Continual Test-Time Adaptation for Semantic Segmentation

no code implementations24 Sep 2023 Jiayi Ni, Senqiao Yang, ran Xu, Jiaming Liu, Xiaoqi Li, Wenyu Jiao, Zehui Chen, Yi Liu, Shanghang Zhang

In this paper, we propose a distribution-aware tuning (DAT) method to make the semantic segmentation CTTA efficient and practical in real-world applications.

Autonomous Driving Semantic Segmentation +1

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

1 code implementation18 Sep 2023 Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Bing Wang, Hongwei Xie, Li Liu, Shanghang Zhang

3D occupancy prediction holds significant promise in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels.

Autonomous Driving

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

1 code implementation14 Aug 2023 Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, Zhiyuan Liu

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost.

Text Generation

PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers

no code implementations1 Jul 2023 Peidong Jia, Jiaming Liu, Senqiao Yang, Jiarui Wu, Xiaodong Xie, Shanghang Zhang

PDM comprehensively leverages the prompt memory to extract domain-specific knowledge and explicitly constructs a long-term memory space for the data distribution, which represents better domain diversity compared to existing methods.

Diversity object-detection +1

DiffuseIR:Diffusion Models For Isotropic Reconstruction of 3D Microscopic Images

no code implementations21 Jun 2023 Mingjie Pan, Yulu Gan, Fangxu Zhou, Jiaming Liu, Aimin Wang, Shanghang Zhang, Dawei Li

Since the diffusion model learns the universal structural distribution of biological tissues, which is independent of the axial resolution, DiffuseIR can reconstruct authentic images with unseen low-axial resolutions into a high-axial resolution without requiring re-training.


UniOcc: Unifying Vision-Centric 3D Occupancy Prediction with Geometric and Semantic Rendering

no code implementations15 Jun 2023 Mingjie Pan, Li Liu, Jiaming Liu, Peixiang Huang, Longlong Wang, Shanghang Zhang, Shaoqing Xu, Zhiyi Lai, Kuiyuan Yang

In this technical report, we present our solution, named UniOCC, for the Vision-Centric 3D occupancy prediction track in the nuScenes Open Dataset Challenge at CVPR 2023.

Prediction Of Occupancy Grid Maps

ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation

2 code implementations7 Jun 2023 Jiaming Liu, Senqiao Yang, Peidong Jia, Renrui Zhang, Ming Lu, Yandong Guo, Wei Xue, Shanghang Zhang

Note that, our method can be regarded as a novel transfer paradigm for large-scale models, delivering promising results in adaptation to continually changing distributions.

Test-time Adaptation

HUB: Guiding Learned Optimizers with Continuous Prompt Tuning

no code implementations26 May 2023 Gaole Dai, Wei Wu, Ziyu Wang, Jie Fu, Shanghang Zhang, Tiejun Huang

By incorporating hand-designed optimizers as the second component in our hybrid approach, we are able to retain the benefits of learned optimizers while stabilizing the training process and, more importantly, improving testing performance.


Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

no code implementations21 May 2023 Yijia Zhang, Lingran Zhao, Shijie Cao, WenQiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu

In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution.


Chain of Thought Prompt Tuning in Vision Language Models

no code implementations16 Apr 2023 Jiaxin Ge, Hongyin Luo, Siyuan Qian, Yulu Gan, Jie Fu, Shanghang Zhang

Chain of Thought is a simple and effective approximation to human reasoning process and has been proven useful for natural language processing (NLP) tasks.

Domain Generalization Image Classification +4

Open-Vocabulary Point-Cloud Object Detection without 3D Annotation

1 code implementation CVPR 2023 Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang

In this paper, we address open-vocabulary 3D point-cloud detection by a dividing-and-conquering strategy, which involves: 1) developing a point-cloud detector that can learn a general representation for localizing various objects, and 2) connecting textual and point-cloud representations to enable the detector to classify novel object categories based on text prompting.

3D Object Detection 3D Open-Vocabulary Object Detection +3

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

1 code implementation CVPR 2023 Anthony Chen, Kevin Zhang, Renrui Zhang, Zihan Wang, Yuheng Lu, Yandong Guo, Shanghang Zhang

Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings.

3D Object Detection Decoder +3

MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID

1 code implementation CVPR 2023 Jianyang Gu, Kai Wang, Hao Luo, Chen Chen, Wei Jiang, Yuqiang Fang, Shanghang Zhang, Yang You, Jian Zhao

Neural Architecture Search (NAS) has been increasingly appealing to the society of object Re-Identification (ReID), for that task-specific architectures significantly improve the retrieval performance.

Image Classification Neural Architecture Search +3

Q-Diffusion: Quantizing Diffusion Models

1 code implementation ICCV 2023 Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, Kurt Keutzer

We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture of the diffusion models, which compresses the noise estimation network to accelerate the generation process.

Image Generation Noise Estimation +1

Improving Generalization of Meta-Learning With Inverted Regularization at Inner-Level

no code implementations CVPR 2023 Lianzhe Wang, Shiji Zhou, Shanghang Zhang, Xu Chu, Heng Chang, Wenwu Zhu

Despite the broad interest in meta-learning, the generalization problem remains one of the significant challenges in this field.


CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level Continuous Sparsification

no code implementations6 Dec 2022 Lirui Xiao, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, Shanghang Zhang

CSQ stabilizes the bit-level mixed-precision training process with a bi-level gradual continuous sparsification on both the bit values of the quantized weights and the bit selection in determining the quantization precision of each layer.


BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks

no code implementations CVPR 2023 Xiaowei Chi, Jiaming Liu, Ming Lu, Rongyu Zhang, Zhaoqing Wang, Yandong Guo, Shanghang Zhang

In order to find them, we further propose a LiDAR-guided sampling strategy to leverage the statistical distribution of LiDAR to determine the heights of local slices.

3D Object Detection Autonomous Driving +1

Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-world

no code implementations CVPR 2023 Yulu Gan, Mingjie Pan, Rongyu Zhang, Zijian Ling, Lingran Zhao, Jiaming Liu, Shanghang Zhang

To enable the device model to deal with changing environments, we propose a new learning paradigm of Cloud-Device Collaborative Continual Adaptation, which encourages collaboration between cloud and device and improves the generalization of the device model.

Device-Cloud Collaboration object-detection +2

BEVUDA: Multi-geometric Space Alignments for Domain Adaptive BEV 3D Object Detection

no code implementations30 Nov 2022 Jiaming Liu, Rongyu Zhang, Xiaoqi Li, Xiaowei Chi, Zehui Chen, Ming Lu, Yandong Guo, Shanghang Zhang

In this paper, we propose a Multi-space Alignment Teacher-Student (MATS) framework to ease the domain shift accumulation, which consists of a Depth-Aware Teacher (DAT) and a Geometric-space Aligned Student (GAS) model.

3D Object Detection Autonomous Driving +4

NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers

no code implementations CVPR 2023 Yijiang Liu, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, Shanghang Zhang

Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution with additive noisy bias to fit a given quantizer.


PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

2 code implementations ICCV 2023 Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, Peng Gao

In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection.

3D Classification 3D Object Detection +11

Margin-Based Few-Shot Class-Incremental Learning with Class-Level Overfitting Mitigation

1 code implementation10 Oct 2022 Yixiong Zou, Shanghang Zhang, Yuhua Li, Ruixuan Li

Few-shot class-incremental learning (FSCIL) is designed to incrementally recognize novel classes with only few training samples after the (pre-)training on base classes with sufficient samples, which focuses on both base-class performance and novel-class generalization.

Few-Shot Class-Incremental Learning Incremental Learning

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

1 code implementation27 Sep 2022 Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, Xianglong Liu

With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices.


Uncertainty Guided Depth Fusion for Spike Camera

no code implementations26 Aug 2022 Jianing Li, Jiaming Liu, Xiaobao Wei, Jiyuan Zhang, Ming Lu, Lei Ma, Li Du, Tiejun Huang, Shanghang Zhang

In this paper, we propose a novel Uncertainty-Guided Depth Fusion (UGDF) framework to fuse the predictions of monocular and stereo depth estimation networks for spike camera.

Autonomous Driving Stereo Depth Estimation

Unsupervised Spike Depth Estimation via Cross-modality Cross-domain Knowledge Transfer

1 code implementation26 Aug 2022 Jiaming Liu, Qizhe Zhang, Jianing Li, Ming Lu, Tiejun Huang, Shanghang Zhang

Neuromorphic spike data, an upcoming modality with high temporal resolution, has shown promising potential in real-world applications due to its inherent advantage to overcome high-velocity motion blur.

Autonomous Driving Depth Estimation +2

Efficient Meta-Tuning for Content-aware Neural Video Delivery

1 code implementation20 Jul 2022 Xiaoqi Li, Jiaming Liu, Shizun Wang, Cheng Lyu, Ming Lu, Yurong Chen, Anbang Yao, Yandong Guo, Shanghang Zhang

Our method significantly reduces the computational cost and achieves even better performance, paving the way for applying neural video delivery techniques to practical applications.


Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

no code implementations5 Jul 2022 Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang

Current point-cloud detection methods have difficulty detecting the open-vocabulary objects in the real world, due to their limited generalization capability.

Cloud Detection Contrastive Learning

MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer

1 code implementation3 May 2022 Jinze Yu, Jiaming Liu, Xiaobao Wei, Haoyi Zhou, Yohei Nakata, Denis Gudovskiy, Tomoyuki Okuno, JianXin Li, Kurt Keutzer, Shanghang Zhang

To solve this problem, we propose an end-to-end cross-domain detection Transformer based on the mean teacher framework, MTTrans, which can fully exploit unlabeled target domain data in object detection training and transfer knowledge between domains via pseudo labels.

Domain Adaptation Object +3

Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting

1 code implementation ICLR 2022 Shikuang Deng, Yuhang Li, Shanghang Zhang, Shi Gu

Then we introduce the temporal efficient training (TET) approach to compensate for the loss of momentum in the gradient descent with SG so that the training process can converge into flatter minima with better generalizability.

Biphasic Face Photo-Sketch Synthesis via Semantic-Driven Generative Adversarial Network with Graph Representation Learning

no code implementations5 Jan 2022 Xingqun Qi, Muyi Sun, Zijian Wang, Jiaming Liu, Qi Li, Fang Zhao, Shanghang Zhang, Caifeng Shan

To preserve the generated faces being more structure-coordinated, the IRSG models inter-class structural relations among every facial component by graph representation learning.

Generative Adversarial Network Graph Representation Learning +1

Differentiable Spike: Rethinking Gradient-Descent for Training Spiking Neural Networks

no code implementations NeurIPS 2021 Yuhang Li, Yufei Guo, Shanghang Zhang, Shikuang Deng, Yongqing Hai, Shi Gu

Based on the introduced finite difference gradient, we propose a new family of Differentiable Spike (Dspike) functions that can adaptively evolve during training to find the optimal shape and smoothness for gradient estimation.

Event data classification Image Classification

2nd Place Solution for VisDA 2021 Challenge -- Universally Domain Adaptive Image Recognition

no code implementations27 Oct 2021 Haojin Liao, Xiaolin Song, Sicheng Zhao, Shanghang Zhang, Xiangyu Yue, Xingxu Yao, Yueming Zhang, Tengfei Xing, Pengfei Xu, Qiang Wang

The Visual Domain Adaptation (VisDA) 2021 Challenge calls for unsupervised domain adaptation (UDA) methods that can deal with both input distribution shift and label set variance between the source and target domains.

Universal Domain Adaptation Unsupervised Domain Adaptation

Meta Learning with Minimax Regularization

no code implementations29 Sep 2021 Lianzhe Wang, Shiji Zhou, Shanghang Zhang, Wenpeng Zhang, Heng Chang, Wenwu Zhu

Even though meta-learning has attracted research wide attention in recent years, the generalization problem of meta-learning is still not well addressed.

Few-Shot Learning

Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency

1 code implementation ICCV 2021 Zhipeng Luo, Zhongang Cai, Changqing Zhou, Gongjie Zhang, Haiyu Zhao, Shuai Yi, Shijian Lu, Hongsheng Li, Shanghang Zhang, Ziwei Liu

In addition, existing 3D domain adaptive detection methods often assume prior access to the target domain annotations, which is rarely feasible in the real world.

3D Object Detection Autonomous Driving +1

Delving Deep into the Generalization of Vision Transformers under Distribution Shifts

1 code implementation CVPR 2022 Chongzhi Zhang, Mingyuan Zhang, Shanghang Zhang, Daisheng Jin, Qiang Zhou, Zhongang Cai, Haiyu Zhao, Xianglong Liu, Ziwei Liu

By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization.

Out-of-Distribution Generalization Self-Supervised Learning

Online Continual Adaptation with Active Self-Training

no code implementations11 Jun 2021 Shiji Zhou, Han Zhao, Shanghang Zhang, Lianzhe Wang, Heng Chang, Zhi Wang, Wenwu Zhu

Our theoretical results show that OSAMD can fast adapt to changing environments with active queries.

Self-Supervised Pretraining Improves Self-Supervised Pretraining

1 code implementation23 Mar 2021 Colorado J. Reed, Xiangyu Yue, Ani Nrusimha, Sayna Ebrahimi, Vivek Vijaykumar, Richard Mao, Bo Li, Shanghang Zhang, Devin Guillory, Sean Metzger, Kurt Keutzer, Trevor Darrell

Through experimentation on 16 diverse vision datasets, we show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.

Image Augmentation

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

no code implementations24 Dec 2020 Yunze Liu, Li Yi, Shanghang Zhang, Qingnan Fan, Thomas Funkhouser, Hao Dong

Self-supervised representation learning is a critical problem in computer vision, as it provides a way to pretrain feature extractors on large unlabeled datasets that can be used as an initialization for more efficient and effective training on downstream tasks.

Contrastive Learning Representation Learning +1

Annotation-Efficient Untrimmed Video Action Recognition

no code implementations30 Nov 2020 Yixiong Zou, Shanghang Zhang, Guangyao Chen, Yonghong Tian, Kurt Keutzer, José M. F. Moura

In this paper, we target a new problem, Annotation-Efficient Video Recognition, to reduce the requirement of annotations for both large amount of samples and the action location.

Action Recognition Contrastive Learning +3

Cross-Domain Sentiment Classification with Contrastive Learning and Mutual Information Maximization

1 code implementation30 Oct 2020 Tian Li, Xiang Chen, Shanghang Zhang, Zhen Dong, Kurt Keutzer

Due to scarcity of labels on the target domain, we introduce mutual information maximization (MIM) apart from CL to exploit the features that best support the final prediction.

Contrastive Learning General Classification +3

A Review of Single-Source Deep Unsupervised Visual Domain Adaptation

1 code implementation1 Sep 2020 Sicheng Zhao, Xiangyu Yue, Shanghang Zhang, Bo Li, Han Zhao, Bichen Wu, Ravi Krishna, Joseph E. Gonzalez, Alberto L. Sangiovanni-Vincentelli, Sanjit A. Seshia, Kurt Keutzer

To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain.

Unsupervised Domain Adaptation

Revisiting Mid-Level Patterns for Cross-Domain Few-Shot Recognition

no code implementations7 Aug 2020 Yixiong Zou, Shanghang Zhang, JianPeng Yu, Yonghong Tian, José M. F. Moura

To solve this problem, cross-domain FSL (CDFSL) is proposed very recently to transfer knowledge from general-domain base classes to special-domain novel classes.

cross-domain few-shot learning

TCGM: An Information-Theoretic Framework for Semi-Supervised Multi-Modality Learning

no code implementations ECCV 2020 Xinwei Sun, Yilun Xu, Peng Cao, Yuqing Kong, Lingjing Hu, Shanghang Zhang, Yizhou Wang

In this paper, we propose a novel information-theoretic approach, namely \textbf{T}otal \textbf{C}orrelation \textbf{G}ain \textbf{M}aximization (TCGM), for semi-supervised multi-modal learning, which is endowed with promising properties: (i) it can utilize effectively the information across different modalities of unlabeled data points to facilitate training classifiers of each modality (ii) it has theoretical guarantee to identify Bayesian classifiers, i. e., the ground truth posteriors of all modalities.

Disease Prediction Emotion Recognition +1

Rethinking Distributional Matching Based Domain Adaptation

no code implementations23 Jun 2020 Bo Li, Yezhen Wang, Tong Che, Shanghang Zhang, Sicheng Zhao, Pengfei Xu, Wei Zhou, Yoshua Bengio, Kurt Keutzer

In this paper, in order to devise robust DA algorithms, we first systematically analyze the limitations of DM based methods, and then build new benchmarks with more realistic domain shifts to evaluate the well-accepted DM methods.

Domain Adaptation

Compositional Few-Shot Recognition with Primitive Discovery and Enhancing

no code implementations12 May 2020 Yixiong Zou, Shanghang Zhang, Ke Chen, Yonghong Tian, Yao-Wei Wang, José M. F. Moura

Inspired by such capability of humans, to imitate humans' ability of learning visual primitives and composing primitives to recognize novel classes, we propose an approach to FSL to learn a feature representation composed of important primitives, which is jointly trained with two parts, i. e. primitive discovery and primitive enhancing.

Few-Shot Image Classification Few-Shot Learning +1

Decoupling Global and Local Representations via Invertible Generative Flows

1 code implementation ICLR 2021 Xuezhe Ma, Xiang Kong, Shanghang Zhang, Eduard Hovy

In this work, we propose a new generative model that is capable of automatically decoupling global and local representations of images in an entirely unsupervised setting, by embedding a generative flow in the VAE framework to model the decoder.

Decoder Density Estimation +3

COVID-CT-Dataset: A CT Scan Dataset about COVID-19

18 code implementations30 Mar 2020 Xingyi Yang, Xuehai He, Jinyu Zhao, Yichen Zhang, Shanghang Zhang, Pengtao Xie

Using this dataset, we develop diagnosis methods based on multi-task learning and self-supervised learning, that achieve an F1 of 0. 90, an AUC of 0. 98, and an accuracy of 0. 89.

Computed Tomography (CT) COVID-19 Diagnosis +2

Decoupling Features and Coordinates for Few-shot RGB Relocalization

no code implementations26 Nov 2019 Siyan Dong, Songyin Wu, Yixin Zhuang, Kai Xu, Shanghang Zhang, Baoquan Chen

To address this issue, we approach camera relocalization with a decoupled solution where feature extraction, coordinate regression, and pose estimation are performed separately.

Camera Relocalization Pose Estimation +1

Multi-source Distilling Domain Adaptation

1 code implementation22 Nov 2019 Sicheng Zhao, Guangzhi Wang, Shanghang Zhang, Yang Gu, Yaxian Li, Zhichao Song, Pengfei Xu, Runbo Hu, Hua Chai, Kurt Keutzer

Deep neural networks suffer from performance decay when there is domain shift between the labeled source domain and unlabeled target domain, which motivates the research on domain adaptation (DA).

Domain Adaptation Multi-Source Unsupervised Domain Adaptation

Generalized Zero-shot ICD Coding

no code implementations28 Sep 2019 Congzheng Song, Shanghang Zhang, Najmeh Sadoughi, Pengtao Xie, Eric Xing

The International Classification of Diseases (ICD) is a list of classification codes for the diagnoses.

General Classification Generalized Zero-Shot Learning +3

Dual Adversarial Semantics-Consistent Network for Generalized Zero-Shot Learning

no code implementations NeurIPS 2019 Jian Ni, Shanghang Zhang, Haiyong Xie

In particular, the primal GAN learns to synthesize inter-class discriminative and semantics-preserving visual features from both the semantic representations of seen/unseen classes and the ones reconstructed by the dual GAN.

Generalized Zero-Shot Learning Transfer Learning

MaCow: Masked Convolutional Generative Flow

2 code implementations NeurIPS 2019 Xuezhe Ma, Xiang Kong, Shanghang Zhang, Eduard Hovy

Flow-based generative models, conceptually attractive due to tractability of both the exact log-likelihood computation and latent-variable inference, and efficiency of both training and sampling, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations.

Computational Efficiency Density Estimation +1

Adversarial Multiple Source Domain Adaptation

no code implementations NeurIPS 2018 Han Zhao, Shanghang Zhang, Guanhang Wu, José M. F. Moura, Joao P. Costeira, Geoffrey J. Gordon

In this paper we propose new generalization bounds and algorithms under both classification and regression settings for unsupervised multiple source domain adaptation.

Classification Domain Adaptation +5

Modeling relation paths for knowledge base completion via joint adversarial training

1 code implementation14 Oct 2018 Chen Li, Xutan Peng, Shanghang Zhang, Hao Peng, Philip S. Yu, Min He, Linfeng Du, Lihong Wang

By treating relations and multi-hop paths as two different input sources, we use a feature extractor, which is shared by two downstream components (i. e. relation classifier and source discriminator), to capture shared/similar information between them.

Knowledge Base Completion Relation

Learning to Understand Image Blur

no code implementations CVPR 2018 Shanghang Zhang, Xiaohui Shen, Zhe Lin, Radomír Měch, João P. Costeira, José M. F. Moura

In this paper, we propose a unified framework to estimate a spatially-varying blur map and understand its desirability in terms of image quality at the same time.

Multiple Source Domain Adaptation with Adversarial Learning

no code implementations ICLR 2018 Han Zhao, Shanghang Zhang, Guanhang Wu, Jo\~{a}o P. Costeira, Jos\'{e} M. F. Moura, Geoffrey J. Gordon

We propose a new generalization bound for domain adaptation when there are multiple source domains with labeled instances and one target domain with unlabeled instances.

Domain Adaptation Sentiment Analysis

Topology Adaptive Graph Convolutional Networks

2 code implementations ICLR 2018 Jian Du, Shanghang Zhang, Guanhang Wu, Jose M. F. Moura, Soummya Kar

Spectral graph convolutional neural networks (CNNs) require approximation to the convolution to alleviate the computational complexity, resulting in performance loss.

FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras

1 code implementation ICCV 2017 Shanghang Zhang, Guanhang Wu, João P. Costeira, José M. F. Moura

To overcome limitations of existing methods and incorporate the temporal information of traffic video, we design a novel FCN-rLSTM network to jointly estimate vehicle density and vehicle count by connecting fully convolutional neural networks (FCN) with long short term memory networks (LSTM) in a residual learning fashion.

Multiple Source Domain Adaptation with Adversarial Training of Neural Networks

4 code implementations26 May 2017 Han Zhao, Shanghang Zhang, Guanhang Wu, João P. Costeira, José M. F. Moura, Geoffrey J. Gordon

As a step toward bridging the gap, we propose a new generalization bound for domain adaptation when there are multiple source domains with labeled instances and one target domain with unlabeled instances.

Domain Adaptation Sentiment Analysis

Understanding Traffic Density from Large-Scale Web Camera Data

1 code implementation CVPR 2017 Shanghang Zhang, Guanhang Wu, João P. Costeira, José M. F. Moura

Understanding traffic density from large-scale web camera (webcam) videos is a challenging problem because such videos have low spatial and temporal resolution, high occlusion and large perspective.


Cannot find the paper you are looking for? You can Submit a new open access paper.