Search Results for author: Zhihang Yuan

Found 21 papers, 13 papers with code

S2DNAS: Transforming Static CNN Model for Dynamic Inference via Neural Architecture Search

no code implementations • ECCV 2020 • Zhihang Yuan, Bingzhe Wu, Guangyu Sun, Zheng Liang, Shiwan Zhao, Weichen Bi

To this end, based on a given CNN model, we first generate a CNN architecture space in which each architecture is a multi-stage CNN generated from the given model using some predefined transformations.

Neural Architecture Search

Paper
Add Code

A Survey on Efficient Inference for Large Language Models

no code implementations • 22 Apr 2024 • Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang

This paper presents a comprehensive survey of the existing literature on efficient LLM inference.

Paper
Add Code

PillarTrack: Redesigning Pillar-based Transformer Network for Single Object Tracking on Point Clouds

2 code implementations • 11 Apr 2024 • Weisheng Xu, Sifan Zhou, Zhihang Yuan

LiDAR-based 3D single object tracking (3D SOT) is a critical issue in robotics and autonomous driving.

3D Single Object Tracking Autonomous Driving +1

144

Paper
Code

LLM Inference Unveiled: Survey and Roofline Model Insights

2 code implementations • 26 Feb 2024 • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer

Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques.

Knowledge Distillation Language Modelling +3

156

Paper
Code

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

no code implementations • 19 Feb 2024 • Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.

Quantization Text Generation

Paper
Add Code

QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning

1 code implementation • 6 Feb 2024 • Haoxuan Wang, Yuzhang Shang, Zhihang Yuan, Junyi Wu, Yan Yan

Diffusion models have achieved remarkable success in image generation tasks, yet their practical deployment is restrained by the high memory and time consumption.

Image Generation Model Compression +1

Paper
Code

MIM4DD: Mutual Information Maximization for Dataset Distillation

1 code implementation • NeurIPS 2023 • Yuzhang Shang, Zhihang Yuan, Yan Yan

Thus, we introduce mutual information (MI) as the metric to quantify the shared information between the synthetic and the real datasets, and devise MIM4DD numerically maximizing the MI via a newly designed optimizable objective within a contrastive learning framework to update the synthetic dataset.

Contrastive Learning

1,158

Paper
Code

Post-Training Quantization for Re-parameterization via Coarse & Fine Weight Splitting

1 code implementation • 17 Dec 2023 • Dawei Yang, Ning He, Xing Hu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang

Although neural networks have made remarkable advancements in various applications, they require substantial computational and memory resources.

Quantization

Paper
Code

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

1 code implementation • 10 Dec 2023 • Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, Guangyu Sun

This paper explores a new post-hoc training-free compression paradigm for compressing Large Language Models (LLMs) to facilitate their wider adoption in various computing environments.

Paper
Code

PB-LLM: Partially Binarized Large Language Models

2 code implementations • 29 Sep 2023 • Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong

This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression.

Binarization Quantization

134

Paper
Code

Latency-aware Unified Dynamic Networks for Efficient Image Recognition

1 code implementation • 30 Aug 2023 • Yizeng Han, Zeyu Liu, Zhihang Yuan, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang

Dynamic computation has emerged as a promising avenue to enhance the inference efficiency of deep networks.

Scheduling

Paper
Code

Improving Post-Training Quantization on Object Detection with Task Loss-Guided Lp Metric

no code implementations • 19 Apr 2023 • Lin Niu, Jiawei Liu, Zhihang Yuan, Dawei Yang, Xinggang Wang, Wenyu Liu

PTQ optimizes the quantization parameters by different metrics to minimize the perturbation of quantization.

Object object-detection +2

Paper
Add Code

RPTQ: Reorder-based Post-training Quantization for Large Language Models

1 code implementation • 3 Apr 2023 • Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu

In this paper, we identify that the challenge in quantizing activations in LLMs arises from varying ranges across channels, rather than solely the presence of outliers.

Quantization

173

Paper
Code

Benchmarking the Reliability of Post-training Quantization: a Particular Focus on Worst-case Performance

no code implementations • 23 Mar 2023 • Zhihang Yuan, Jiawei Liu, Jiaxiang Wu, Dawei Yang, Qiang Wu, Guangyu Sun, Wenyu Liu, Xinggang Wang, Bingzhe Wu

Post-training quantization (PTQ) is a popular method for compressing deep neural networks (DNNs) without modifying their original architecture or training procedures.

Benchmarking Data Augmentation +1

Paper
Add Code

PD-Quant: Post-Training Quantization based on Prediction Difference Metric

1 code implementation • CVPR 2023 • Jiawei Liu, Lin Niu, Zhihang Yuan, Dawei Yang, Xinggang Wang, Wenyu Liu

It determines the quantization parameters by using the information of differences between network prediction before and after quantization.

Neural Network Compression Quantization

Paper
Code

Post-training Quantization on Diffusion Models

1 code implementation • CVPR 2023 • Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, Yan Yan

These approaches define a forward diffusion process for transforming data into noise and a backward denoising process for sampling data from noise.

Denoising Noise Estimation +1

101

Paper
Code

Latency-aware Spatial-wise Dynamic Networks

2 code implementations • 12 Oct 2022 • Yizeng Han, Zhihang Yuan, Yifan Pu, Chenhao Xue, Shiji Song, Guangyu Sun, Gao Huang

The latency prediction model can efficiently estimate the inference latency of dynamic networks by simultaneously considering algorithms, scheduling strategies, and hardware properties.

Image Classification Instance Segmentation +4

Paper
Code

PTQ4ViT: Post-Training Quantization Framework for Vision Transformers with Twin Uniform Quantization

1 code implementation • 24 Nov 2021 • Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, Guangyu Sun

We observe the distributions of activation values after softmax and GELU functions are quite different from the Gaussian distribution.

Quantization

160

Paper
Code

PTQ-SL: Exploring the Sub-layerwise Post-training Quantization

no code implementations • 15 Oct 2021 • Zhihang Yuan, Yiqi Chen, Chenhao Xue, Chenguang Zhang, Qiankun Wang, Guangyu Sun

Network quantization is a powerful technique to compress convolutional neural networks.

Quantization

Paper
Add Code

ENAS4D: Efficient Multi-stage CNN Architecture Search for Dynamic Inference

no code implementations • 19 Sep 2020 • Zhihang Yuan, Xin Liu, Bingzhe Wu, Guangyu Sun

The inference of a input sample can exit from early stage if the prediction of the stage is confident enough.

Paper
Add Code

S2DNAS:Transforming Static CNN Model for Dynamic Inference via Neural Architecture Search

no code implementations • 16 Nov 2019 • Zhihang Yuan, Bingzhe Wu, Zheng Liang, Shiwan Zhao, Weichen Bi, Guangyu Sun

Recently, dynamic inference has emerged as a promising way to reduce the computational cost of deep convolutional neural network (CNN).

Neural Architecture Search

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.