no code implementations • ECCV 2020 • Zhihang Yuan, Bingzhe Wu, Guangyu Sun, Zheng Liang, Shiwan Zhao, Weichen Bi
To this end, based on a given CNN model, we first generate a CNN architecture space in which each architecture is a multi-stage CNN generated from the given model using some predefined transformations.
no code implementations • 18 Dec 2024 • Zhihang Yuan, Yuzhang Shang, Hanling Zhang, Tongcheng Fang, Rui Xie, Bingxin Xu, Yan Yan, Shengen Yan, Guohao Dai, Yu Wang
Our approach not only enhances computational efficiency but also aligns naturally with image generation principles by operating in continuous token space and following a hierarchical generation process from coarse to fine details.
1 code implementation • 3 Dec 2024 • Zhaofeng Hu, Sifan Zhou, Shibo Zhao, Zhihang Yuan
3D single object tracking is essential in autonomous driving and robotics.
no code implementations • 26 Nov 2024 • Rui Xie, Tianchen Zhao, Zhihang Yuan, Rui Wan, Wenxi Gao, Zhenhua Zhu, Xuefei Ning, Yu Wang
Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models.
1 code implementation • 16 Sep 2024 • Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang
Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension.
1 code implementation • 12 Jul 2024 • Chuanhao Sun, Zhihang Yuan, Kai Xu, Luo Mai, N. Siddharth, Shuo Chen, Mahesh K. Marina
Fourier features based positional encoding (PE) is commonly used in machine learning tasks that involve learning high-frequency features from low-dimensional inputs, such as 3D view synthesis and time series regression with neural tangent kernels.
no code implementations • 12 Jun 2024 • Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang
Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators.
no code implementations • 29 May 2024 • Sifan Zhou, Zhihang Yuan, Dawei Yang, Xubin Wen, Xing Hu, Yuguang Shi, Ziyu Zhao, Xiaobo Lu
To address above issue, we first unveil the importance of different input information during PFE and identify the height dimension as a key factor in enhancing 3D detection performance.
no code implementations • 28 May 2024 • Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu, Sifan Zhou
Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs).
2 code implementations • 27 May 2024 • Kai Wang, Mingjia Shi, Yukun Zhou, Zekai Li, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Hanwang Zhang, Yang You
Training diffusion models is always a computation-intensive task.
no code implementations • 10 May 2024 • Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin
Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels.
no code implementations • 22 Apr 2024 • Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang
This paper presents a comprehensive survey of the existing literature on efficient LLM inference.
1 code implementation • 11 Apr 2024 • Weisheng Xu, Sifan Zhou, Zhihang Yuan
LiDAR-based 3D single object tracking (3D SOT) is a critical issue in robotics and autonomous driving.
2 code implementations • 26 Feb 2024 • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques.
no code implementations • 19 Feb 2024 • Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
1 code implementation • 6 Feb 2024 • Haoxuan Wang, Yuzhang Shang, Zhihang Yuan, Junyi Wu, Junchi Yan, Yan Yan
We empirically verify that our approach modifies the activation distribution and provides meaningful temporal information, facilitating easier and more accurate quantization.
1 code implementation • NeurIPS 2023 • Yuzhang Shang, Zhihang Yuan, Yan Yan
Thus, we introduce mutual information (MI) as the metric to quantify the shared information between the synthetic and the real datasets, and devise MIM4DD numerically maximizing the MI via a newly designed optimizable objective within a contrastive learning framework to update the synthetic dataset.
1 code implementation • 17 Dec 2023 • Dawei Yang, Ning He, Xing Hu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang
Although neural networks have made remarkable advancements in various applications, they require substantial computational and memory resources.
1 code implementation • 10 Dec 2023 • Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, Guangyu Sun
Based on the success of the low-rank decomposition of projection matrices in the self-attention module, we further introduce ASVD to compress the KV cache.
2 code implementations • 29 Sep 2023 • Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong
This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression.
1 code implementation • 30 Aug 2023 • Yizeng Han, Zeyu Liu, Zhihang Yuan, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang
Dynamic computation has emerged as a promising avenue to enhance the inference efficiency of deep networks.
no code implementations • 19 Apr 2023 • Lin Niu, Jiawei Liu, Zhihang Yuan, Dawei Yang, Xinggang Wang, Wenyu Liu
PTQ optimizes the quantization parameters by different metrics to minimize the perturbation of quantization.
1 code implementation • 3 Apr 2023 • Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu
In this paper, we identify that the challenge in quantizing activations in LLMs arises from varying ranges across channels, rather than solely the presence of outliers.
no code implementations • 23 Mar 2023 • Zhihang Yuan, Jiawei Liu, Jiaxiang Wu, Dawei Yang, Qiang Wu, Guangyu Sun, Wenyu Liu, Xinggang Wang, Bingzhe Wu
Post-training quantization (PTQ) is a popular method for compressing deep neural networks (DNNs) without modifying their original architecture or training procedures.
1 code implementation • CVPR 2023 • Jiawei Liu, Lin Niu, Zhihang Yuan, Dawei Yang, Xinggang Wang, Wenyu Liu
It determines the quantization parameters by using the information of differences between network prediction before and after quantization.
1 code implementation • CVPR 2023 • Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, Yan Yan
These approaches define a forward diffusion process for transforming data into noise and a backward denoising process for sampling data from noise.
2 code implementations • 12 Oct 2022 • Yizeng Han, Zhihang Yuan, Yifan Pu, Chenhao Xue, Shiji Song, Guangyu Sun, Gao Huang
The latency prediction model can efficiently estimate the inference latency of dynamic networks by simultaneously considering algorithms, scheduling strategies, and hardware properties.
1 code implementation • 24 Nov 2021 • Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, Guangyu Sun
We observe the distributions of activation values after softmax and GELU functions are quite different from the Gaussian distribution.
no code implementations • 15 Oct 2021 • Zhihang Yuan, Yiqi Chen, Chenhao Xue, Chenguang Zhang, Qiankun Wang, Guangyu Sun
Network quantization is a powerful technique to compress convolutional neural networks.
no code implementations • 19 Sep 2020 • Zhihang Yuan, Xin Liu, Bingzhe Wu, Guangyu Sun
The inference of a input sample can exit from early stage if the prediction of the stage is confident enough.
no code implementations • 16 Nov 2019 • Zhihang Yuan, Bingzhe Wu, Zheng Liang, Shiwan Zhao, Weichen Bi, Guangyu Sun
Recently, dynamic inference has emerged as a promising way to reduce the computational cost of deep convolutional neural network (CNN).