Search Results for author: Elias Frantar

Found 21 papers, 16 papers with code

Extreme Compression of Large Language Models via Additive Quantization

1 code implementation11 Jan 2024 Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh

The emergence of accurate open large language models (LLMs) has led to a race towards quantization techniques for such models enabling execution on end-user devices.

Quantization

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

1 code implementation25 Oct 2023 Elias Frantar, Dan Alistarh

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts.

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

1 code implementation13 Oct 2023 Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.

Computational Efficiency Quantization

Sparse Fine-tuning for Inference Acceleration of Large Language Models

2 code implementations10 Oct 2023 Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh

While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth.

Quantization Text Generation +1

Scaling Laws for Sparsely-Connected Foundation Models

no code implementations15 Sep 2023 Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i. e., "foundation models"), in both vision and language domains.

Computational Efficiency

Accurate Neural Network Pruning Requires Rethinking Sparse Optimization

no code implementations3 Aug 2023 Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh

Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community.

Model Compression Network Pruning +1

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

1 code implementation7 Jul 2023 Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs.

Code Generation

Error Feedback Can Accurately Compress Preconditioners

1 code implementation9 Jun 2023 Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Elias Frantar, Dan Alistarh

Experiments on deep neural networks show that this approach can compress full-matrix preconditioners to up to 99\% sparsity without accuracy loss, effectively removing the memory overhead of full-matrix preconditioners such as GGT and M-FAC.

Classification Second-order methods

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

no code implementations25 Mar 2023 Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh

To address this, we ask: can we quickly compress large generalist models into accurate and efficient specialists?

ZipLM: Inference-Aware Structured Pruning of Language Models

1 code implementation NeurIPS 2023 Eldar Kurtic, Elias Frantar, Dan Alistarh

Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications.

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

3 code implementations2 Jan 2023 Elias Frantar, Dan Alistarh

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.

 Ranked #1 on Language Modelling on WikiText-2 (using extra training data)

Common Sense Reasoning Language Modelling +2

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

11 code implementations31 Oct 2022 Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.

Language Modelling Model Compression +1

L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning

1 code implementation31 Oct 2022 Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, Dan Alistarh

Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks.

Image Classification Language Modelling +1

CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models

no code implementations NeurIPS 2023 Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh

To further showcase CAP's accuracy and scalability, we use it to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.

Image Classification Quantization

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

1 code implementation24 Aug 2022 Elias Frantar, Sidak Pal Singh, Dan Alistarh

We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data.

Model Compression Quantization

SPDY: Accurate Pruning with Speedup Guarantees

1 code implementation31 Jan 2022 Elias Frantar, Dan Alistarh

The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular.

Model Compression

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

2 code implementations NeurIPS 2021 Elias Frantar, Eldar Kurtic, Dan Alistarh

We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian.

Network Pruning Second-order methods

On the Sample Complexity of Adversarial Multi-Source PAC Learning

no code implementations ICML 2020 Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert

We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms.

PAC learning

Cannot find the paper you are looking for? You can Submit a new open access paper.