Search Results for author: Xincheng Wang

Found 1 papers, 1 papers with code

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

1 code implementation • 13 Oct 2023 • Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.

Computational Efficiency Quantization

157

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.