no code implementations • ICML 2020 • Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, Dan Alistarh
In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains.
no code implementations • 31 Oct 2023 • Rustem Islamov, Mher Safaryan, Dan Alistarh
As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD.
1 code implementation • 25 Oct 2023 • Elias Frantar, Dan Alistarh
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts.
1 code implementation • 13 Oct 2023 • Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh
We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.
2 code implementations • 10 Oct 2023 • Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh
While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth.
no code implementations • 6 Oct 2023 • Arshia Soltani Moakhar, Eugenia Iofinova, Dan Alistarh
Towards this goal, multiple tools have been proposed to aid a human examiner in reasoning about a network's behavior in general or on a set of instances.
no code implementations • 15 Sep 2023 • Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i. e., "foundation models"), in both vision and language domains.
no code implementations • 3 Aug 2023 • Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh
Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community.
1 code implementation • 7 Jul 2023 • Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel
We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs.
no code implementations • 14 Jun 2023 • John Lazarsfeld, Dan Alistarh
We study a distributed multi-armed bandit setting among a population of $n$ memory-constrained nodes in the gossip model: at each round, every node locally adopts one of $m$ arms, observes a reward drawn from the arm's (adversarially chosen) distribution, and then communicates with a randomly sampled neighbor, exchanging information to determine its policy in the next round.
1 code implementation • 9 Jun 2023 • Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Dan Alistarh
Extensive experiments on deep neural networks for vision show that this approach can compress full-matrix preconditioners by up to two orders of magnitude without impact on accuracy, effectively removing the memory overhead of full-matrix preconditioning for implementations of full-matrix Adagrad (GGT) and natural gradient (M-FAC).
1 code implementation • 5 Jun 2023 • Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh
Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities.
no code implementations • CVPR 2023 • Eugenia Iofinova, Alexandra Peste, Dan Alistarh
Pruning - that is, setting a significant subset of the parameters of a neural network to zero - is one of the most popular methods of model compression.
no code implementations • 25 Mar 2023 • Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh
To address this, we ask: can we quickly compress large generalist models into accurate and efficient specialists?
1 code implementation • 9 Feb 2023 • Mahdi Nikdan, Tommaso Pegolotti, Eugenia Iofinova, Eldar Kurtic, Dan Alistarh
We provide a new efficient version of the backpropagation algorithm, specialized to the case where the weights of the neural network being trained are sparse.
no code implementations • 5 Feb 2023 • Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh
Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs).
1 code implementation • 2 Jan 2023 • Elias Frantar, Dan Alistarh
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.
Ranked #1 on
Language Modelling
on WikiText-2
(using extra training data)
1 code implementation • 31 Oct 2022 • Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, Dan Alistarh
Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks.
8 code implementations • 31 Oct 2022 • Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.
no code implementations • 14 Oct 2022 • Shayan Talaei, Giorgi Nadiradze, Dan Alistarh
Distributed optimization has become one of the standard ways of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods.
no code implementations • NeurIPS 2023 • Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh
To further showcase CAP's accuracy and scalability, we use it to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.
no code implementations • 12 Oct 2022 • Eldar Kurtic, Dan Alistarh
We revisit the performance of the classic gradual magnitude pruning (GMP) baseline for large language models, focusing on the classic BERT benchmark on various popular tasks.
1 code implementation • 24 Aug 2022 • Elias Frantar, Sidak Pal Singh, Dan Alistarh
We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data.
1 code implementation • 28 Jul 2022 • Alexandra Peste, Adrian Vladu, Eldar Kurtic, Christoph H. Lampert, Dan Alistarh
In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning.
no code implementations • 20 Jun 2022 • Hossein Zakerinia, Shayan Talaei, Giorgi Nadiradze, Dan Alistarh
Federated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally.
2 code implementations • 14 Mar 2022 • Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh
We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models.
1 code implementation • 13 Mar 2022 • Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh
Our scheme is based on the following algorithmic tools and features: (a) asynchronous local gradient updates on the shared-memory of workers, (b) partial backpropagation, and (c) non-blocking in-place averaging of the local models.
1 code implementation • 31 Jan 2022 • Elias Frantar, Dan Alistarh
The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular.
1 code implementation • CVPR 2022 • Eugenia Iofinova, Alexandra Peste, Mark Kurtz, Dan Alistarh
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" specialized datasets.
1 code implementation • 16 Nov 2021 • Ilia Markov, Hamidreza Ramezanikebrya, Dan Alistarh
CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication.
no code implementations • 8 Jul 2021 • Alexandra Peste, Dan Alistarh, Christoph H. Lampert
The availability of large amounts of user-provided data has been key to the success of machine learning for many real-world tasks.
2 code implementations • NeurIPS 2021 • Elias Frantar, Eldar Kurtic, Dan Alistarh
We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian.
2 code implementations • NeurIPS 2021 • Alexandra Peste, Eugenia Iofinova, Adrian Vladu, Dan Alistarh
The increasing computational requirements of deep neural networks (DNNs) have led to significant interest in obtaining DNN models that are sparse, yet accurate.
Ranked #1 on
Network Pruning
on CIFAR-100
no code implementations • 28 Apr 2021 • Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training.
no code implementations • 17 Feb 2021 • Dan Alistarh, Rati Gelashvili, Joel Rybicki
Let $G$ be a graph on $n$ nodes.
Distributed, Parallel, and Cluster Computing Data Structures and Algorithms
no code implementations • 31 Jan 2021 • Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste
The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components.
no code implementations • 1 Jan 2021 • Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh
On the theoretical side, we show that this method guarantees ergodic convergence for non-convex objectives, and achieves the classic sublinear rate under standard assumptions.
no code implementations • ICLR 2021 • Zeyuan Allen-Zhu, Faeze Ebrahimian, Jerry Li, Dan Alistarh
We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions.
no code implementations • NeurIPS 2020 • Vitalii Aksenov, Dan Alistarh, Janne H. Korhonen
The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning.
1 code implementation • NeurIPS 2020 • Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel Roy, Ali Ramezani-Kebrya
Many communication-efficient variants of SGD use gradient quantization schemes.
no code implementations • NeurIPS 2021 • Dan Alistarh, Janne H. Korhonen
We focus on the communication complexity of this problem: our main result provides the first fully unconditional bounds on total number of bits which need to be sent and received by the $N$ machines to solve this problem under point-to-point communication, within a given error-tolerance.
no code implementations • 28 Sep 2020 • Janne H. Korhonen, Dan Alistarh
Motivated by the interest in communication-efficient methods for distributed machine learning, we consider the communication complexity of minimising a sum of $d$-dimensional functions $\sum_{i = 1}^N f_i (x)$, where each function $f_i$ is held by one of the $N$ different machines.
no code implementations • 12 Jun 2020 • Vyacheslav Kungurtsev, Bapi Chatterjee, Dan Alistarh
Stochastic Gradient Langevin Dynamics (SGLD) ensures strong guarantees with regards to convergence in measure for sampling log-concave posterior distributions by adding noise to stochastic gradient iterates.
no code implementations • 30 Apr 2020 • Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler
For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.
1 code implementation • NeurIPS 2020 • Sidak Pal Singh, Dan Alistarh
Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems.
1 code implementation • 20 Mar 2020 • Dan Alistarh, Nikita Koval, Giorgi Nadiradze
We show that, for algorithms such as Delaunay mesh triangulation and sorting by insertion, schedulers with a maximum relaxation factor of $k$ in terms of the maximum priority inversion allowed will introduce a maximum amount of wasted work of $O(log(n) poly (k) ), $ where $n$ is the number of tasks to be executed.
Data Structures and Algorithms Distributed, Parallel, and Cluster Computing
no code implementations • 25 Feb 2020 • Vitaly Aksenov, Dan Alistarh, Janne H. Korhonen
The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning.
no code implementations • ICML 2020 • Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert
We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms.
no code implementations • ICLR 2021 • Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, Dan Alistarh
We provide a method of quantization which allows distributed mean estimation to be performed with solution quality dependent only on the distance between inputs, not on input norm, and show an analogous result for distributed variance reduction.
no code implementations • 16 Jan 2020 • Giorgi Nadiradze, Ilia Markov, Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh
Our framework, called elastic consistency enables us to derive convergence bounds for a variety of distributed SGD methods used in practice to train large-scale machine learning models.
no code implementations • NeurIPS 2021 • Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, Dan Alistarh
Perhaps surprisingly, we show that a variant of SGD called \emph{SwarmSGD} still converges in this setting, even if \emph{non-blocking communication}, \emph{quantization}, and \emph{local steps} are all applied \emph{in conjunction}, and even if the node data distributions and underlying graph topology are both \emph{heterogenous}.
no code implementations • 25 Sep 2019 • Vyacheslav Kungurtsev, Malcolm Egan, Bapi Chatterjee, Dan Alistarh
This is all the more surprising since these objectives are the ones appearing in the training of deep neural networks.
no code implementations • 25 Sep 2019 • Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed on clusters to perform model fitting in parallel.
1 code implementation • NeurIPS 2019 • Chris Wendler, Dan Alistarh, Markus Püschel
We present a novel class of convolutional neural networks (CNNs) for set functions, i. e., data indexed with the powerset of a finite set.
no code implementations • 12 Aug 2019 • Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler
Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself.
no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar
Machine learning (ML) techniques are enjoying rapidly increasing adoption.
no code implementations • 17 Oct 2018 • Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu
Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e. g., gradients or models), the network should guarantee the delivery of the message.
no code implementations • NeurIPS 2018 • Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, Cédric Renggli
Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.
no code implementations • NeurIPS 2018 • Dan Alistarh, Zeyuan Allen-Zhu, Jerry Li
This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially.
no code implementations • 23 Mar 2018 • Dan Alistarh, Christopher De Sa, Nikola Konstantinov
Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks.
no code implementations • 22 Feb 2018 • Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, Torsten Hoefler
This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads.
5 code implementations • ICLR 2018 • Antonio Polino, Razvan Pascanu, Dan Alistarh
Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning.
no code implementations • 14 Feb 2018 • Nezihe Merve Gürel, Kaan Kara, Alen Stojanov, Tyler Smith, Thomas Lemmin, Dan Alistarh, Markus Püschel, Ce Zhang
Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems.
1 code implementation • 13 Feb 2018 • David Dao, Dan Alistarh, Claudiu Musat, Ce Zhang
We illustrate that trusted computation can enable the creation of an AI market, where each data point has an exact value that should be paid to its creator.
no code implementations • ICML 2017 • Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, Ce Zhang
We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees?
1 code implementation • 13 Jun 2017 • Dan Alistarh, Justin Kopinsky, Jerry Li, Giorgi Nadiradze
We answer this question, showing that this strategy provides surprisingly strong guarantees: Although the single-choice process, where we always insert and remove from a single randomly chosen queue, has degrading cost, going to infinity as we increase the number of steps, in the two choice process, the expected rank of a removed element is $O( n )$ while the expected worst-case cost is $O( n \log n )$.
Data Structures and Algorithms Distributed, Parallel, and Cluster Computing
1 code implementation • 16 Nov 2016 • Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, Ce Zhang
When applied to linear models together with double sampling, we save up to another 1. 7x in data movement compared with uniform quantization.
2 code implementations • NeurIPS 2017 • Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, Milan Vojnovic
In this paper, we propose Quantized SGD (QSGD), a family of compression schemes which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions.
no code implementations • NeurIPS 2015 • Dan Alistarh, Jennifer Iglesias, Milan Vojnovic
In many applications, the data is of rich structure that can be represented by a hypergraph, where the data items are represented by vertices and the associations among items are represented by hyperedges.