Search Results for author: Dan Alistarh

Found 69 papers, 28 papers with code

Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks

no code implementations ICML 2020 Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, Dan Alistarh

In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains.

Image Classification

AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms

no code implementations31 Oct 2023 Rustem Islamov, Mher Safaryan, Dan Alistarh

As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD.

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

1 code implementation25 Oct 2023 Elias Frantar, Dan Alistarh

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts.

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

1 code implementation13 Oct 2023 Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.


Sparse Fine-tuning for Inference Acceleration of Large Language Models

2 code implementations10 Oct 2023 Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh

While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth.

Quantization Text Generation +1

SPADE: Sparsity-Guided Debugging for Deep Neural Networks

no code implementations6 Oct 2023 Arshia Soltani Moakhar, Eugenia Iofinova, Dan Alistarh

Towards this goal, multiple tools have been proposed to aid a human examiner in reasoning about a network's behavior in general or on a set of instances.

Learning Theory

Scaling Laws for Sparsely-Connected Foundation Models

no code implementations15 Sep 2023 Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i. e., "foundation models"), in both vision and language domains.

Accurate Neural Network Pruning Requires Rethinking Sparse Optimization

no code implementations3 Aug 2023 Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh

Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community.

Model Compression Network Pruning +1

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

1 code implementation7 Jul 2023 Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs.

Code Generation

The Power of Populations in Decentralized Learning Dynamics

no code implementations14 Jun 2023 John Lazarsfeld, Dan Alistarh

We study a distributed multi-armed bandit setting among a population of $n$ memory-constrained nodes in the gossip model: at each round, every node locally adopts one of $m$ arms, observes a reward drawn from the arm's (adversarially chosen) distribution, and then communicates with a randomly sampled neighbor, exchanging information to determine its policy in the next round.

Error Feedback Can Accurately Compress Preconditioners

1 code implementation9 Jun 2023 Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Dan Alistarh

Extensive experiments on deep neural networks for vision show that this approach can compress full-matrix preconditioners by up to two orders of magnitude without impact on accuracy, effectively removing the memory overhead of full-matrix preconditioning for implementations of full-matrix Adagrad (GGT) and natural gradient (M-FAC).

Classification Second-order methods

Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures

no code implementations CVPR 2023 Eugenia Iofinova, Alexandra Peste, Dan Alistarh

Pruning - that is, setting a significant subset of the parameters of a neural network to zero - is one of the most popular methods of model compression.

Model Compression Network Pruning

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

no code implementations25 Mar 2023 Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh

To address this, we ask: can we quickly compress large generalist models into accurate and efficient specialists?

SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

1 code implementation9 Feb 2023 Mahdi Nikdan, Tommaso Pegolotti, Eugenia Iofinova, Eldar Kurtic, Dan Alistarh

We provide a new efficient version of the backpropagation algorithm, specialized to the case where the weights of the neural network being trained are sparse.

Transfer Learning

Quantized Distributed Training of Large Models with Convergence Guarantees

no code implementations5 Feb 2023 Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh

Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs).


SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

1 code implementation2 Jan 2023 Elias Frantar, Dan Alistarh

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.

 Ranked #1 on Language Modelling on WikiText-2 (using extra training data)

Common Sense Reasoning Language Modelling +2

L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning

1 code implementation31 Oct 2022 Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, Dan Alistarh

Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks.

Image Classification Language Modelling +1

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

8 code implementations31 Oct 2022 Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.

Language Modelling Model Compression +1

Hybrid Decentralized Optimization: First- and Zeroth-Order Optimizers Can Be Jointly Leveraged For Faster Convergence

no code implementations14 Oct 2022 Shayan Talaei, Giorgi Nadiradze, Dan Alistarh

Distributed optimization has become one of the standard ways of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods.

Distributed Optimization

CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models

no code implementations NeurIPS 2023 Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh

To further showcase CAP's accuracy and scalability, we use it to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.

Image Classification Quantization

GMP*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods

no code implementations12 Oct 2022 Eldar Kurtic, Dan Alistarh

We revisit the performance of the classic gradual magnitude pruning (GMP) baseline for large language models, focusing on the classic BERT benchmark on various popular tasks.

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

1 code implementation24 Aug 2022 Elias Frantar, Sidak Pal Singh, Dan Alistarh

We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data.

Model Compression Quantization

CrAM: A Compression-Aware Minimizer

1 code implementation28 Jul 2022 Alexandra Peste, Adrian Vladu, Eldar Kurtic, Christoph H. Lampert, Dan Alistarh

In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning.

Image Classification Language Modelling +2

Communication-Efficient Federated Learning With Data and Client Heterogeneity

no code implementations20 Jun 2022 Hossein Zakerinia, Shayan Talaei, Giorgi Nadiradze, Dan Alistarh

Federated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally.

Federated Learning

Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD

1 code implementation13 Mar 2022 Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh

Our scheme is based on the following algorithmic tools and features: (a) asynchronous local gradient updates on the shared-memory of workers, (b) partial backpropagation, and (c) non-blocking in-place averaging of the local models.

Blocking Image Classification

SPDY: Accurate Pruning with Speedup Guarantees

1 code implementation31 Jan 2022 Elias Frantar, Dan Alistarh

The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular.

Model Compression

How Well Do Sparse Imagenet Models Transfer?

1 code implementation CVPR 2022 Eugenia Iofinova, Alexandra Peste, Mark Kurtz, Dan Alistarh

Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" specialized datasets.

Transfer Learning

CGX: Adaptive System Support for Communication-Efficient Deep Learning

1 code implementation16 Nov 2021 Ilia Markov, Hamidreza Ramezanikebrya, Dan Alistarh

CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication.

SSSE: Efficiently Erasing Samples from Trained Machine Learning Models

no code implementations8 Jul 2021 Alexandra Peste, Dan Alistarh, Christoph H. Lampert

The availability of large amounts of user-provided data has been key to the success of machine learning for many real-world tasks.

BIG-bench Machine Learning

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

2 code implementations NeurIPS 2021 Elias Frantar, Eldar Kurtic, Dan Alistarh

We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian.

Network Pruning Second-order methods

AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks

2 code implementations NeurIPS 2021 Alexandra Peste, Eugenia Iofinova, Adrian Vladu, Dan Alistarh

The increasing computational requirements of deep neural networks (DNNs) have led to significant interest in obtaining DNN models that are sparse, yet accurate.

Network Pruning

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

no code implementations28 Apr 2021 Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy

As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training.


Fast Graphical Population Protocols

no code implementations17 Feb 2021 Dan Alistarh, Rati Gelashvili, Joel Rybicki

Let $G$ be a graph on $n$ nodes.

Distributed, Parallel, and Cluster Computing Data Structures and Algorithms

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

no code implementations31 Jan 2021 Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components.

Local SGD Meets Asynchrony

no code implementations1 Jan 2021 Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh

On the theoretical side, we show that this method guarantees ergodic convergence for non-convex objectives, and achieves the classic sublinear rate under standard assumptions.


Byzantine-Resilient Non-Convex Stochastic Gradient Descent

no code implementations ICLR 2021 Zeyuan Allen-Zhu, Faeze Ebrahimian, Jerry Li, Dan Alistarh

We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions.

Distributed Optimization

Scalable Belief Propagation via Relaxed Scheduling

no code implementations NeurIPS 2020 Vitalii Aksenov, Dan Alistarh, Janne H. Korhonen

The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning.

BIG-bench Machine Learning Scheduling

Towards Tight Communication Lower Bounds for Distributed Optimisation

no code implementations NeurIPS 2021 Dan Alistarh, Janne H. Korhonen

We focus on the communication complexity of this problem: our main result provides the first fully unconditional bounds on total number of bits which need to be sent and received by the $N$ machines to solve this problem under point-to-point communication, within a given error-tolerance.

Improved Communication Lower Bounds for Distributed Optimisation

no code implementations28 Sep 2020 Janne H. Korhonen, Dan Alistarh

Motivated by the interest in communication-efficient methods for distributed machine learning, we consider the communication complexity of minimising a sum of $d$-dimensional functions $\sum_{i = 1}^N f_i (x)$, where each function $f_i$ is held by one of the $N$ different machines.

Stochastic Gradient Langevin with Delayed Gradients

no code implementations12 Jun 2020 Vyacheslav Kungurtsev, Bapi Chatterjee, Dan Alistarh

Stochastic Gradient Langevin Dynamics (SGLD) ensures strong guarantees with regards to convergence in measure for sampling log-concave posterior distributions by adding noise to stochastic gradient iterates.

Stochastic Optimization

WoodFisher: Efficient Second-Order Approximation for Neural Network Compression

1 code implementation NeurIPS 2020 Sidak Pal Singh, Dan Alistarh

Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems.

Image Classification Neural Network Compression

Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers

1 code implementation20 Mar 2020 Dan Alistarh, Nikita Koval, Giorgi Nadiradze

We show that, for algorithms such as Delaunay mesh triangulation and sorting by insertion, schedulers with a maximum relaxation factor of $k$ in terms of the maximum priority inversion allowed will introduce a maximum amount of wasted work of $O(log(n) poly (k) ), $ where $n$ is the number of tasks to be executed.

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing

Relaxed Scheduling for Scalable Belief Propagation

no code implementations25 Feb 2020 Vitaly Aksenov, Dan Alistarh, Janne H. Korhonen

The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning.

BIG-bench Machine Learning Scheduling

On the Sample Complexity of Adversarial Multi-Source PAC Learning

no code implementations ICML 2020 Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert

We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms.

PAC learning Test

New Bounds For Distributed Mean Estimation and Variance Reduction

no code implementations ICLR 2021 Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, Dan Alistarh

We provide a method of quantization which allows distributed mean estimation to be performed with solution quality dependent only on the distance between inputs, not on input norm, and show an analogous result for distributed variance reduction.

Distributed Optimization Quantization

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

no code implementations16 Jan 2020 Giorgi Nadiradze, Ilia Markov, Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh

Our framework, called elastic consistency enables us to derive convergence bounds for a variety of distributed SGD methods used in practice to train large-scale machine learning models.

BIG-bench Machine Learning

Asynchronous Decentralized SGD with Quantized and Local Updates

no code implementations NeurIPS 2021 Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, Dan Alistarh

Perhaps surprisingly, we show that a variant of SGD called \emph{SwarmSGD} still converges in this setting, even if \emph{non-blocking communication}, \emph{quantization}, and \emph{local steps} are all applied \emph{in conjunction}, and even if the node data distributions and underlying graph topology are both \emph{heterogenous}.

Blocking Distributed Optimization +2

Asynchronous Stochastic Subgradient Methods for General Nonsmooth Nonconvex Optimization

no code implementations25 Sep 2019 Vyacheslav Kungurtsev, Malcolm Egan, Bapi Chatterjee, Dan Alistarh

This is all the more surprising since these objectives are the ones appearing in the training of deep neural networks.


Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

no code implementations25 Sep 2019 Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy

As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed on clusters to perform model fitting in parallel.


Powerset Convolutional Neural Networks

1 code implementation NeurIPS 2019 Chris Wendler, Dan Alistarh, Markus Püschel

We present a novel class of convolutional neural networks (CNNs) for set functions, i. e., data indexed with the powerset of a finite set.

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

no code implementations12 Aug 2019 Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself.

Distributed Learning over Unreliable Networks

no code implementations17 Oct 2018 Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu

Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e. g., gradients or models), the network should guarantee the delivery of the message.

BIG-bench Machine Learning

The Convergence of Sparsified Gradient Methods

no code implementations NeurIPS 2018 Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, Cédric Renggli

Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.


Byzantine Stochastic Gradient Descent

no code implementations NeurIPS 2018 Dan Alistarh, Zeyuan Allen-Zhu, Jerry Li

This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially.

Stochastic Optimization

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory

no code implementations23 Mar 2018 Dan Alistarh, Christopher De Sa, Nikola Konstantinov

Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks.

BIG-bench Machine Learning

Model compression via distillation and quantization

5 code implementations ICLR 2018 Antonio Polino, Razvan Pascanu, Dan Alistarh

Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning.

Model Compression Quantization

DataBright: Towards a Global Exchange for Decentralized Data Ownership and Trusted Computation

1 code implementation13 Feb 2018 David Dao, Dan Alistarh, Claudiu Musat, Ce Zhang

We illustrate that trusted computation can enable the creation of an AI market, where each data point has an exact value that should be paid to its creator.

BIG-bench Machine Learning

ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning

no code implementations ICML 2017 Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, Ce Zhang

We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees?


The Power of Choice in Priority Scheduling

1 code implementation13 Jun 2017 Dan Alistarh, Justin Kopinsky, Jerry Li, Giorgi Nadiradze

We answer this question, showing that this strategy provides surprisingly strong guarantees: Although the single-choice process, where we always insert and remove from a single randomly chosen queue, has degrading cost, going to infinity as we increase the number of steps, in the two choice process, the expected rank of a removed element is $O( n )$ while the expected worst-case cost is $O( n \log n )$.

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing

The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning

1 code implementation16 Nov 2016 Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, Ce Zhang

When applied to linear models together with double sampling, we save up to another 1. 7x in data movement compared with uniform quantization.


QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

2 code implementations NeurIPS 2017 Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, Milan Vojnovic

In this paper, we propose Quantized SGD (QSGD), a family of compression schemes which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions.

Image Classification Quantization +2

Streaming Min-max Hypergraph Partitioning

no code implementations NeurIPS 2015 Dan Alistarh, Jennifer Iglesias, Milan Vojnovic

In many applications, the data is of rich structure that can be represented by a hypergraph, where the data items are represented by vertices and the associations among items are represented by hyperedges.

Clustering hypergraph partitioning

Cannot find the paper you are looking for? You can Submit a new open access paper.