2 code implementations • 21 Aug 2024 • Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, Dan Alistarh
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment.
1 code implementation • 11 Jan 2024 • Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh
The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices.
1 code implementation • 25 Oct 2023 • Elias Frantar, Dan Alistarh
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts.
1 code implementation • 13 Oct 2023 • Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh
We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.
2 code implementations • 10 Oct 2023 • Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh
While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth.
1 code implementation • 6 Oct 2023 • Arshia Soltani Moakhar, Eugenia Iofinova, Elias Frantar, Dan Alistarh
In this paper, we demonstrate, for the first time, that sparsity can instead be incorporated into the interpretation process itself, as a sample-specific preprocessing step.
no code implementations • 15 Sep 2023 • Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i. e., "foundation models"), in both vision and language domains.
no code implementations • 3 Aug 2023 • Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh
Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community.
1 code implementation • 7 Jul 2023 • Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel
We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs.
1 code implementation • 9 Jun 2023 • Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Elias Frantar, Dan Alistarh
Experiments on deep neural networks show that this approach can compress full-matrix preconditioners to up to 99\% sparsity without accuracy loss, effectively removing the memory overhead of full-matrix preconditioners such as GGT and M-FAC.
1 code implementation • 5 Jun 2023 • Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh
Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities.
1 code implementation • 27 Apr 2023 • Joo Hyung Lee, Wonpyo Park, Nicole Mitchell, Jonathan Pilault, Johan Obando-Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart Bik, Woohyun Han, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci
This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research.
no code implementations • 25 Mar 2023 • Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh
To address this, we ask: can we quickly compress large generalist models into accurate and efficient specialists?
1 code implementation • NeurIPS 2023 • Eldar Kurtic, Elias Frantar, Dan Alistarh
Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications.
5 code implementations • 2 Jan 2023 • Elias Frantar, Dan Alistarh
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.
Ranked #1 on
Language Modelling
on WikiText-2
(using extra training data)
16 code implementations • 31 Oct 2022 • Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.
1 code implementation • 31 Oct 2022 • Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, Dan Alistarh
Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks.
no code implementations • NeurIPS 2023 • Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh
To further showcase CAP's accuracy and scalability, we use it to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.
1 code implementation • 24 Aug 2022 • Elias Frantar, Sidak Pal Singh, Dan Alistarh
We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data.
2 code implementations • 14 Mar 2022 • Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh
We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models.
1 code implementation • 31 Jan 2022 • Elias Frantar, Dan Alistarh
The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular.
2 code implementations • NeurIPS 2021 • Elias Frantar, Eldar Kurtic, Dan Alistarh
We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian.
no code implementations • ICML 2020 • Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert
We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms.