no code implementations • ICML 2020 • Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, Dan Alistarh
In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains.
no code implementations • 5 Jan 2025 • Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, Dan Alistarh
We present HALO, a novel quantization-aware training approach for Transformers that enables accurate and efficient low-precision training by combining 1) strategic placement of Hadamard rotations in both forward and backward passes, to mitigate outliers during the low-precision computation, 2) FSDP integration for low-precision communication, and 3) high-performance kernel support.
1 code implementation • 26 Nov 2024 • Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, Dan Alistarh
Quantizing large language models has become a standard way to reduce their memory and computational costs.
no code implementations • 4 Nov 2024 • Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh
We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs.
1 code implementation • 21 Oct 2024 • Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, Dan Alistarh
We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training.
1 code implementation • 18 Oct 2024 • Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh
Yet, current methods rely on heuristics for identifying the "importance" of a given layer towards the loss, based on assumptions such as \emph{error monotonicity}, i. e. that the end-to-end model compression error is proportional to the sum of layer-wise errors.
no code implementations • 8 Oct 2024 • Jiale Chen, Dingling Yao, Adeel Pervez, Dan Alistarh, Francesco Locatello
We propose Scalable Mechanistic Neural Network (S-MNN), an enhanced neural network framework designed for scientific machine learning applications involving long temporal sequences.
no code implementations • 31 Aug 2024 • Vage Egiazarian, Denis Kuznedelev, Anton Voronov, Ruslan Svirschevski, Michael Goin, Daniil Pavlov, Dan Alistarh, Dmitry Baranchuk
Specifically, we tailor vector-based PTQ methods to recent billion-scale text-to-image models (SDXL and SDXL-Turbo), and show that the diffusion models of 2B+ parameters compressed to around 3 bits using VQ exhibit the similar image quality and textual alignment as previous 4-bit compression techniques.
no code implementations • 30 Aug 2024 • Diyuan Wu, Ionut-Vlad Modoranu, Mher Safaryan, Denis Kuznedelev, Dan Alistarh
The rising footprint of machine learning has led to a focus on imposing \emph{model sparsity} as a means of reducing computational and memory costs.
2 code implementations • 21 Aug 2024 • Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, Dan Alistarh
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment.
1 code implementation • 24 Jun 2024 • Armand Nicolicioiu, Eugenia Iofinova, Eldar Kurtic, Mahdi Nikdan, Andrei Panferov, Ilia Markov, Nir Shavit, Dan Alistarh
Specifically, Panza can be both trained and inferenced locally on commodity hardware, and is personalized to the user's writing style.
1 code implementation • 18 Jun 2024 • Eldar Kurtic, Amir Moeini, Dan Alistarh
We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving.
1 code implementation • 24 May 2024 • Shashata Sawmya, Linghao Kong, Ilia Markov, Dan Alistarh, Nir Shavit
We show how to improve the inference efficiency of an LLM by expanding it into a mixture of sparse experts, where each expert is a copy of the original weights, one-shot pruned for a specific cluster of input values.
1 code implementation • 24 May 2024 • Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtic, Thomas Robert, Peter Richtarik, Dan Alistarh
We propose a new variant of the Adam optimizer called MicroAdam that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees.
1 code implementation • 23 May 2024 • Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik
In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs.
no code implementations • 6 May 2024 • Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz
We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset.
1 code implementation • 4 Apr 2024 • Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, Yoon Kim
We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult.
3 code implementations • 30 Mar 2024 • Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman
We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits.
1 code implementation • 11 Jan 2024 • Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh
The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices.
2 code implementations • 9 Jan 2024 • Mahdi Nikdan, Soroush Tabesh, Elvir Crnčević, Dan Alistarh
We investigate parameter-efficient fine-tuning (PEFT) methods that can provide good accuracy under limited computational and memory budgets in the context of large language models (LLMs).
no code implementations • 21 Dec 2023 • Eldar Kurtic, Torsten Hoefler, Dan Alistarh
Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task.
no code implementations • 11 Dec 2023 • Paniz Halvachi, Alexandra Peste, Dan Alistarh, Christoph H. Lampert
We present ELSA, a practical solution for creating deep networks that can easily be deployed at different levels of sparsity.
no code implementations • 31 Oct 2023 • Rustem Islamov, Mher Safaryan, Dan Alistarh
As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD.
1 code implementation • 25 Oct 2023 • Elias Frantar, Dan Alistarh
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts.
1 code implementation • 13 Oct 2023 • Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh
We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.
2 code implementations • 10 Oct 2023 • Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh
While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth.
1 code implementation • 6 Oct 2023 • Arshia Soltani Moakhar, Eugenia Iofinova, Elias Frantar, Dan Alistarh
In this paper, we demonstrate, for the first time, that sparsity can instead be incorporated into the interpretation process itself, as a sample-specific preprocessing step.
no code implementations • 15 Sep 2023 • Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i. e., "foundation models"), in both vision and language domains.
no code implementations • 3 Aug 2023 • Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh
Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community.
1 code implementation • 7 Jul 2023 • Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel
We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs.
no code implementations • 14 Jun 2023 • John Lazarsfeld, Dan Alistarh
We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round.
1 code implementation • 9 Jun 2023 • Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Elias Frantar, Dan Alistarh
Experiments on deep neural networks show that this approach can compress full-matrix preconditioners to up to 99\% sparsity without accuracy loss, effectively removing the memory overhead of full-matrix preconditioners such as GGT and M-FAC.
1 code implementation • 5 Jun 2023 • Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh
Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities.
1 code implementation • NeurIPS 2023 • Mher Safaryan, Alexandra Peste, Dan Alistarh
We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism.
no code implementations • CVPR 2023 • Eugenia Iofinova, Alexandra Peste, Dan Alistarh
Pruning - that is, setting a significant subset of the parameters of a neural network to zero - is one of the most popular methods of model compression.
no code implementations • 25 Mar 2023 • Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh
To address this, we ask: can we quickly compress large generalist models into accurate and efficient specialists?
1 code implementation • 9 Feb 2023 • Mahdi Nikdan, Tommaso Pegolotti, Eugenia Iofinova, Eldar Kurtic, Dan Alistarh
We provide a new efficient version of the backpropagation algorithm, specialized to the case where the weights of the neural network being trained are sparse.
1 code implementation • NeurIPS 2023 • Eldar Kurtic, Elias Frantar, Dan Alistarh
Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications.
no code implementations • 5 Feb 2023 • Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh
Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs).
5 code implementations • 2 Jan 2023 • Elias Frantar, Dan Alistarh
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.
Ranked #1 on Language Modelling on WikiText-2 (using extra training data)
16 code implementations • 31 Oct 2022 • Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.
1 code implementation • 31 Oct 2022 • Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, Dan Alistarh
Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks.
no code implementations • NeurIPS 2023 • Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh
To further showcase CAP's accuracy and scalability, we use it to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.
no code implementations • 14 Oct 2022 • Matin Ansaripour, Shayan Talaei, Giorgi Nadiradze, Dan Alistarh
Distributed optimization is the standard way of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods.
no code implementations • 12 Oct 2022 • Eldar Kurtic, Dan Alistarh
We revisit the performance of the classic gradual magnitude pruning (GMP) baseline for large language models, focusing on the classic BERT benchmark on various popular tasks.
1 code implementation • 24 Aug 2022 • Elias Frantar, Sidak Pal Singh, Dan Alistarh
We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data.
1 code implementation • 28 Jul 2022 • Alexandra Peste, Adrian Vladu, Eldar Kurtic, Christoph H. Lampert, Dan Alistarh
In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning.
no code implementations • 20 Jun 2022 • Hossein Zakerinia, Shayan Talaei, Giorgi Nadiradze, Dan Alistarh
Federated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally.
2 code implementations • 14 Mar 2022 • Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh
We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models.
1 code implementation • 13 Mar 2022 • Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh
Our scheme is based on the following algorithmic tools and features: (a) asynchronous local gradient updates on the shared-memory of workers, (b) partial backpropagation, and (c) non-blocking in-place averaging of the local models.
1 code implementation • 31 Jan 2022 • Elias Frantar, Dan Alistarh
The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular.
1 code implementation • CVPR 2022 • Eugenia Iofinova, Alexandra Peste, Mark Kurtz, Dan Alistarh
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" specialized datasets.
1 code implementation • 16 Nov 2021 • Ilia Markov, Hamidreza Ramezanikebrya, Dan Alistarh
CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication.
no code implementations • 8 Jul 2021 • Alexandra Peste, Dan Alistarh, Christoph H. Lampert
The availability of large amounts of user-provided data has been key to the success of machine learning for many real-world tasks.
2 code implementations • NeurIPS 2021 • Elias Frantar, Eldar Kurtic, Dan Alistarh
We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian.
2 code implementations • NeurIPS 2021 • Alexandra Peste, Eugenia Iofinova, Adrian Vladu, Dan Alistarh
The increasing computational requirements of deep neural networks (DNNs) have led to significant interest in obtaining DNN models that are sparse, yet accurate.
Ranked #1 on Network Pruning on CIFAR-100
no code implementations • 28 Apr 2021 • Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training.
no code implementations • 17 Feb 2021 • Dan Alistarh, Rati Gelashvili, Joel Rybicki
Let $G$ be a graph on $n$ nodes.
Distributed, Parallel, and Cluster Computing Data Structures and Algorithms
no code implementations • 31 Jan 2021 • Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste
The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components.
no code implementations • 1 Jan 2021 • Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh
On the theoretical side, we show that this method guarantees ergodic convergence for non-convex objectives, and achieves the classic sublinear rate under standard assumptions.
no code implementations • ICLR 2021 • Zeyuan Allen-Zhu, Faeze Ebrahimian, Jerry Li, Dan Alistarh
We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions.
no code implementations • NeurIPS 2020 • Vitalii Aksenov, Dan Alistarh, Janne H. Korhonen
The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning.
1 code implementation • NeurIPS 2020 • Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel Roy, Ali Ramezani-Kebrya
Many communication-efficient variants of SGD use gradient quantization schemes.
no code implementations • NeurIPS 2021 • Dan Alistarh, Janne H. Korhonen
We focus on the communication complexity of this problem: our main result provides the first fully unconditional bounds on total number of bits which need to be sent and received by the $N$ machines to solve this problem under point-to-point communication, within a given error-tolerance.
no code implementations • 28 Sep 2020 • Janne H. Korhonen, Dan Alistarh
Motivated by the interest in communication-efficient methods for distributed machine learning, we consider the communication complexity of minimising a sum of $d$-dimensional functions $\sum_{i = 1}^N f_i (x)$, where each function $f_i$ is held by one of the $N$ different machines.
no code implementations • 12 Jun 2020 • Vyacheslav Kungurtsev, Bapi Chatterjee, Dan Alistarh
Stochastic Gradient Langevin Dynamics (SGLD) ensures strong guarantees with regards to convergence in measure for sampling log-concave posterior distributions by adding noise to stochastic gradient iterates.
no code implementations • 30 Apr 2020 • Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler
For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.
1 code implementation • NeurIPS 2020 • Sidak Pal Singh, Dan Alistarh
Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems.
1 code implementation • 20 Mar 2020 • Dan Alistarh, Nikita Koval, Giorgi Nadiradze
We show that, for algorithms such as Delaunay mesh triangulation and sorting by insertion, schedulers with a maximum relaxation factor of $k$ in terms of the maximum priority inversion allowed will introduce a maximum amount of wasted work of $O(log(n) poly (k) ), $ where $n$ is the number of tasks to be executed.
Data Structures and Algorithms Distributed, Parallel, and Cluster Computing
no code implementations • 25 Feb 2020 • Vitaly Aksenov, Dan Alistarh, Janne H. Korhonen
The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning.
no code implementations • ICML 2020 • Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert
We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms.
no code implementations • ICLR 2021 • Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, Dan Alistarh
We provide a method of quantization which allows distributed mean estimation to be performed with solution quality dependent only on the distance between inputs, not on input norm, and show an analogous result for distributed variance reduction.
no code implementations • 16 Jan 2020 • Giorgi Nadiradze, Ilia Markov, Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh
Our framework, called elastic consistency enables us to derive convergence bounds for a variety of distributed SGD methods used in practice to train large-scale machine learning models.
no code implementations • NeurIPS 2021 • Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, Dan Alistarh
Perhaps surprisingly, we show that a variant of SGD called \emph{SwarmSGD} still converges in this setting, even if \emph{non-blocking communication}, \emph{quantization}, and \emph{local steps} are all applied \emph{in conjunction}, and even if the node data distributions and underlying graph topology are both \emph{heterogenous}.
no code implementations • 25 Sep 2019 • Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed on clusters to perform model fitting in parallel.
no code implementations • 25 Sep 2019 • Vyacheslav Kungurtsev, Malcolm Egan, Bapi Chatterjee, Dan Alistarh
This is all the more surprising since these objectives are the ones appearing in the training of deep neural networks.
1 code implementation • NeurIPS 2019 • Chris Wendler, Dan Alistarh, Markus Püschel
We present a novel class of convolutional neural networks (CNNs) for set functions, i. e., data indexed with the powerset of a finite set.
no code implementations • 12 Aug 2019 • Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler
Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself.
no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar
Machine learning (ML) techniques are enjoying rapidly increasing adoption.
no code implementations • 17 Oct 2018 • Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu
Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e. g., gradients or models), the network should guarantee the delivery of the message.
no code implementations • NeurIPS 2018 • Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, Cédric Renggli
Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.
no code implementations • 23 Mar 2018 • Dan Alistarh, Christopher De Sa, Nikola Konstantinov
Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks.
no code implementations • NeurIPS 2018 • Dan Alistarh, Zeyuan Allen-Zhu, Jerry Li
This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially.
no code implementations • 22 Feb 2018 • Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, Torsten Hoefler
This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads.
5 code implementations • ICLR 2018 • Antonio Polino, Razvan Pascanu, Dan Alistarh
Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning.
no code implementations • 14 Feb 2018 • Nezihe Merve Gürel, Kaan Kara, Alen Stojanov, Tyler Smith, Thomas Lemmin, Dan Alistarh, Markus Püschel, Ce Zhang
Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems.
1 code implementation • 13 Feb 2018 • David Dao, Dan Alistarh, Claudiu Musat, Ce Zhang
We illustrate that trusted computation can enable the creation of an AI market, where each data point has an exact value that should be paid to its creator.
no code implementations • ICML 2017 • Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, Ce Zhang
We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees?
1 code implementation • 13 Jun 2017 • Dan Alistarh, Justin Kopinsky, Jerry Li, Giorgi Nadiradze
We answer this question, showing that this strategy provides surprisingly strong guarantees: Although the single-choice process, where we always insert and remove from a single randomly chosen queue, has degrading cost, going to infinity as we increase the number of steps, in the two choice process, the expected rank of a removed element is $O( n )$ while the expected worst-case cost is $O( n \log n )$.
Data Structures and Algorithms Distributed, Parallel, and Cluster Computing
1 code implementation • 16 Nov 2016 • Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, Ce Zhang
When applied to linear models together with double sampling, we save up to another 1. 7x in data movement compared with uniform quantization.
2 code implementations • NeurIPS 2017 • Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, Milan Vojnovic
In this paper, we propose Quantized SGD (QSGD), a family of compression schemes which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions.
no code implementations • NeurIPS 2015 • Dan Alistarh, Jennifer Iglesias, Milan Vojnovic
In many applications, the data is of rich structure that can be represented by a hypergraph, where the data items are represented by vertices and the associations among items are represented by hyperedges.