Search Results for author: Elias Frantar

Found 21 papers, 16 papers with code

Extreme Compression of Large Language Models via Additive Quantization

1 code implementation • 11 Jan 2024 • Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh

The emergence of accurate open large language models (LLMs) has led to a race towards quantization techniques for such models enabling execution on end-user devices.

Quantization

801

Paper
Code

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

1 code implementation • 25 Oct 2023 • Elias Frantar, Dan Alistarh

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts.

247

Paper
Code

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

1 code implementation • 13 Oct 2023 • Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.

Computational Efficiency Quantization

154

Paper
Code

Sparse Fine-tuning for Inference Acceleration of Large Language Models

2 code implementations • 10 Oct 2023 • Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh

While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth.

Quantization Text Generation +1

2,869

Paper
Code

Scaling Laws for Sparsely-Connected Foundation Models

no code implementations • 15 Sep 2023 • Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i. e., "foundation models"), in both vision and language domains.

Computational Efficiency

Paper
Add Code

Accurate Neural Network Pruning Requires Rethinking Sparse Optimization

no code implementations • 3 Aug 2023 • Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh

Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community.

Model Compression Network Pruning +1

Paper
Add Code

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

1 code implementation • 7 Jul 2023 • Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs.

Code Generation

Paper
Code

Error Feedback Can Accurately Compress Preconditioners

1 code implementation • 9 Jun 2023 • Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Elias Frantar, Dan Alistarh

Experiments on deep neural networks show that this approach can compress full-matrix preconditioners to up to 99\% sparsity without accuracy loss, effectively removing the memory overhead of full-matrix preconditioners such as GGT and M-FAC.

Classification Second-order methods

Paper
Code

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

1 code implementation • 5 Jun 2023 • Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities.

Language Modelling Large Language Model +1

512

Paper
Code

JaxPruner: A concise library for sparsity research

1 code implementation • 27 Apr 2023 • Joo Hyung Lee, Wonpyo Park, Nicole Mitchell, Jonathan Pilault, Johan Obando-Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart Bik, Woohyun Han, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci

This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research.

196

Paper
Code

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

no code implementations • 25 Mar 2023 • Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh

To address this, we ask: can we quickly compress large generalist models into accurate and efficient specialists?

Paper
Add Code

ZipLM: Inference-Aware Structured Pruning of Language Models

1 code implementation • NeurIPS 2023 • Eldar Kurtic, Elias Frantar, Dan Alistarh

Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications.

Paper
Code

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

3 code implementations • 2 Jan 2023 • Elias Frantar, Dan Alistarh

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.

Ranked #1 on Language Modelling on WikiText-2 (using extra training data)

Common Sense Reasoning Language Modelling +2

622

Paper
Code

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

11 code implementations • 31 Oct 2022 • Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.

Language Modelling Model Compression +1

18,306

Paper
Code

L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning

1 code implementation • 31 Oct 2022 • Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, Dan Alistarh

Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks.

Image Classification Language Modelling +1

Paper
Code

CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models

no code implementations • NeurIPS 2023 • Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh

To further showcase CAP's accuracy and scalability, we use it to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.

Image Classification Quantization

Paper
Add Code

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

1 code implementation • 24 Aug 2022 • Elias Frantar, Sidak Pal Singh, Dan Alistarh

We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data.

Model Compression Quantization

Paper
Code

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

2 code implementations • 14 Mar 2022 • Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh

We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models.

Quantization

2,869

Paper
Code

SPDY: Accurate Pruning with Speedup Guarantees

1 code implementation • 31 Jan 2022 • Elias Frantar, Dan Alistarh

The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular.

Model Compression

Paper
Code

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

2 code implementations • NeurIPS 2021 • Elias Frantar, Eldar Kurtic, Dan Alistarh

We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian.

Network Pruning Second-order methods

Paper
Code

On the Sample Complexity of Adversarial Multi-Source PAC Learning

no code implementations • ICML 2020 • Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert

We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms.

PAC learning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.