Search Results for author: Dan Alistarh

Found 77 papers, 34 papers with code

Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks

no code implementations • ICML 2020 • Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, Dan Alistarh

In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains.

Image Classification

Paper
Add Code

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization

1 code implementation • 4 Apr 2024 • Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, Yoon Kim

We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult.

Language Modelling Quantization

Paper
Code

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

1 code implementation • 30 Mar 2024 • Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits.

Quantization

121

Paper
Code

Extreme Compression of Large Language Models via Additive Quantization

1 code implementation • 11 Jan 2024 • Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh

The emergence of accurate open large language models (LLMs) has led to a race towards quantization techniques for such models enabling execution on end-user devices.

Quantization

789

Paper
Code

RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation

1 code implementation • 9 Jan 2024 • Mahdi Nikdan, Soroush Tabesh, Elvir Crnčević, Dan Alistarh

We investigate parameter-efficient fine-tuning (PEFT) methods that can provide good accuracy under limited computational and memory budgets in the context of large language models (LLMs).

Math Quantization

Paper
Code

How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark

no code implementations • 21 Dec 2023 • Eldar Kurtic, Torsten Hoefler, Dan Alistarh

Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task.

Knowledge Distillation Language Modelling

Paper
Add Code

ELSA: Partial Weight Freezing for Overhead-Free Sparse Network Deployment

no code implementations • 11 Dec 2023 • Paniz Halvachi, Alexandra Peste, Dan Alistarh, Christoph H. Lampert

We present ELSA, a practical solution for creating deep networks that can easily be deployed at different levels of sparsity.

Paper
Add Code

AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms

no code implementations • 31 Oct 2023 • Rustem Islamov, Mher Safaryan, Dan Alistarh

As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD.

Paper
Add Code

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

1 code implementation • 25 Oct 2023 • Elias Frantar, Dan Alistarh

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts.

245

Paper
Code

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

1 code implementation • 13 Oct 2023 • Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.

Computational Efficiency Quantization

152

Paper
Code

Sparse Fine-tuning for Inference Acceleration of Large Language Models

2 code implementations • 10 Oct 2023 • Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh

While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth.

Quantization Text Generation +1

2,861

Paper
Code

SPADE: Sparsity-Guided Debugging for Deep Neural Networks

no code implementations • 6 Oct 2023 • Arshia Soltani Moakhar, Eugenia Iofinova, Dan Alistarh

Towards this goal, multiple tools have been proposed to aid a human examiner in reasoning about a network's behavior in general or on a set of instances.

Learning Theory

Paper
Add Code

Scaling Laws for Sparsely-Connected Foundation Models

no code implementations • 15 Sep 2023 • Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i. e., "foundation models"), in both vision and language domains.

Computational Efficiency

Paper
Add Code

Accurate Neural Network Pruning Requires Rethinking Sparse Optimization

no code implementations • 3 Aug 2023 • Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh

Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community.

Model Compression Network Pruning +1

Paper
Add Code

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

1 code implementation • 7 Jul 2023 • Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs.

Code Generation

Paper
Code

The Power of Populations in Decentralized Bandits

no code implementations • 14 Jun 2023 • John Lazarsfeld, Dan Alistarh

We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which informs its policy in the next round.

Paper
Add Code

Error Feedback Can Accurately Compress Preconditioners

1 code implementation • 9 Jun 2023 • Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Elias Frantar, Dan Alistarh

Experiments on deep neural networks show that this approach can compress full-matrix preconditioners to up to 99\% sparsity without accuracy loss, effectively removing the memory overhead of full-matrix preconditioners such as GGT and M-FAC.

Classification Second-order methods

Paper
Code

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

1 code implementation • 5 Jun 2023 • Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities.

Language Modelling Large Language Model +1

507

Paper
Code

Knowledge Distillation Performs Partial Variance Reduction

1 code implementation • NeurIPS 2023 • Mher Safaryan, Alexandra Peste, Dan Alistarh

We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism.

Knowledge Distillation

Paper
Code

Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures

no code implementations • CVPR 2023 • Eugenia Iofinova, Alexandra Peste, Dan Alistarh

Pruning - that is, setting a significant subset of the parameters of a neural network to zero - is one of the most popular methods of model compression.

Model Compression Network Pruning

Paper
Add Code

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

no code implementations • 25 Mar 2023 • Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh

To address this, we ask: can we quickly compress large generalist models into accurate and efficient specialists?

Paper
Add Code

SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

1 code implementation • 9 Feb 2023 • Mahdi Nikdan, Tommaso Pegolotti, Eugenia Iofinova, Eldar Kurtic, Dan Alistarh

We provide a new efficient version of the backpropagation algorithm, specialized to the case where the weights of the neural network being trained are sparse.

Transfer Learning

Paper
Code

ZipLM: Inference-Aware Structured Pruning of Language Models

1 code implementation • NeurIPS 2023 • Eldar Kurtic, Elias Frantar, Dan Alistarh

Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications.

Paper
Code

Quantized Distributed Training of Large Models with Convergence Guarantees

no code implementations • 5 Feb 2023 • Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh

Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs).

Quantization

Paper
Add Code

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

3 code implementations • 2 Jan 2023 • Elias Frantar, Dan Alistarh

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.

Ranked #1 on Language Modelling on WikiText-2 (using extra training data)

Common Sense Reasoning Language Modelling +2

615

Paper
Code

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

11 code implementations • 31 Oct 2022 • Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.

Language Modelling Model Compression +1

17,720

Paper
Code

L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning

1 code implementation • 31 Oct 2022 • Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, Dan Alistarh

Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks.

Image Classification Language Modelling +1

Paper
Code

Hybrid Decentralized Optimization: First- and Zeroth-Order Optimizers Can Be Jointly Leveraged For Faster Convergence

no code implementations • 14 Oct 2022 • Shayan Talaei, Giorgi Nadiradze, Dan Alistarh

Distributed optimization has become one of the standard ways of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods.

Distributed Optimization

Paper
Add Code

CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models

no code implementations • NeurIPS 2023 • Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh

To further showcase CAP's accuracy and scalability, we use it to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.

Image Classification Quantization

Paper
Add Code

GMP*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods

no code implementations • 12 Oct 2022 • Eldar Kurtic, Dan Alistarh

We revisit the performance of the classic gradual magnitude pruning (GMP) baseline for large language models, focusing on the classic BERT benchmark on various popular tasks.

Paper
Add Code

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

1 code implementation • 24 Aug 2022 • Elias Frantar, Sidak Pal Singh, Dan Alistarh

We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data.

Model Compression Quantization

Paper
Code

CrAM: A Compression-Aware Minimizer

1 code implementation • 28 Jul 2022 • Alexandra Peste, Adrian Vladu, Eldar Kurtic, Christoph H. Lampert, Dan Alistarh

In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning.

Image Classification Language Modelling +2

Paper
Code

Communication-Efficient Federated Learning With Data and Client Heterogeneity

no code implementations • 20 Jun 2022 • Hossein Zakerinia, Shayan Talaei, Giorgi Nadiradze, Dan Alistarh

Federated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally.

Federated Learning

Paper
Add Code

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

2 code implementations • 14 Mar 2022 • Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh

We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models.

Quantization

2,861

Paper
Code

Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD

1 code implementation • 13 Mar 2022 • Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh

Our scheme is based on the following algorithmic tools and features: (a) asynchronous local gradient updates on the shared-memory of workers, (b) partial backpropagation, and (c) non-blocking in-place averaging of the local models.

Blocking Image Classification

Paper
Code

SPDY: Accurate Pruning with Speedup Guarantees

1 code implementation • 31 Jan 2022 • Elias Frantar, Dan Alistarh

The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular.

Model Compression

Paper
Code

How Well Do Sparse Imagenet Models Transfer?

1 code implementation • CVPR 2022 • Eugenia Iofinova, Alexandra Peste, Mark Kurtz, Dan Alistarh

Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" specialized datasets.

Transfer Learning

2,861

Paper
Code

CGX: Adaptive System Support for Communication-Efficient Deep Learning

1 code implementation • 16 Nov 2021 • Ilia Markov, Hamidreza Ramezanikebrya, Dan Alistarh

CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication.

Paper
Code

SSSE: Efficiently Erasing Samples from Trained Machine Learning Models

no code implementations • 8 Jul 2021 • Alexandra Peste, Dan Alistarh, Christoph H. Lampert

The availability of large amounts of user-provided data has been key to the success of machine learning for many real-world tasks.

BIG-bench Machine Learning

Paper
Add Code

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

2 code implementations • NeurIPS 2021 • Elias Frantar, Eldar Kurtic, Dan Alistarh

We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian.

Network Pruning Second-order methods

Paper
Code

AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks

2 code implementations • NeurIPS 2021 • Alexandra Peste, Eugenia Iofinova, Adrian Vladu, Dan Alistarh

The increasing computational requirements of deep neural networks (DNNs) have led to significant interest in obtaining DNN models that are sparse, yet accurate.

Ranked #1 on Network Pruning on CIFAR-100

Network Pruning

Paper
Code

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

no code implementations • 28 Apr 2021 • Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy

As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training.

Quantization

Paper
Add Code

Fast Graphical Population Protocols

no code implementations • 17 Feb 2021 • Dan Alistarh, Rati Gelashvili, Joel Rybicki

Let $G$ be a graph on $n$ nodes.

Distributed, Parallel, and Cluster Computing Data Structures and Algorithms

Paper
Add Code

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

no code implementations • 31 Jan 2021 • Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components.

Paper
Add Code

Local SGD Meets Asynchrony

no code implementations • 1 Jan 2021 • Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh

On the theoretical side, we show that this method guarantees ergodic convergence for non-convex objectives, and achieves the classic sublinear rate under standard assumptions.

Blocking

Paper
Add Code

Byzantine-Resilient Non-Convex Stochastic Gradient Descent

no code implementations • ICLR 2021 • Zeyuan Allen-Zhu, Faeze Ebrahimian, Jerry Li, Dan Alistarh

We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions.

Distributed Optimization

Paper
Add Code

Scalable Belief Propagation via Relaxed Scheduling

no code implementations • NeurIPS 2020 • Vitalii Aksenov, Dan Alistarh, Janne H. Korhonen

The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning.

BIG-bench Machine Learning Scheduling

Paper
Add Code

Adaptive Gradient Quantization for Data-Parallel SGD

1 code implementation • NeurIPS 2020 • Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel Roy, Ali Ramezani-Kebrya

Many communication-efficient variants of SGD use gradient quantization schemes.

Quantization

Paper
Code

Towards Tight Communication Lower Bounds for Distributed Optimisation

no code implementations • NeurIPS 2021 • Dan Alistarh, Janne H. Korhonen

We focus on the communication complexity of this problem: our main result provides the first fully unconditional bounds on total number of bits which need to be sent and received by the $N$ machines to solve this problem under point-to-point communication, within a given error-tolerance.

Paper
Add Code

Improved Communication Lower Bounds for Distributed Optimisation

no code implementations • 28 Sep 2020 • Janne H. Korhonen, Dan Alistarh

Motivated by the interest in communication-efficient methods for distributed machine learning, we consider the communication complexity of minimising a sum of $d$-dimensional functions $\sum_{i = 1}^N f_i (x)$, where each function $f_i$ is held by one of the $N$ different machines.

Paper
Add Code

Stochastic Gradient Langevin with Delayed Gradients

no code implementations • 12 Jun 2020 • Vyacheslav Kungurtsev, Bapi Chatterjee, Dan Alistarh

Stochastic Gradient Langevin Dynamics (SGLD) ensures strong guarantees with regards to convergence in measure for sampling log-concave posterior distributions by adding noise to stochastic gradient iterates.

Stochastic Optimization

Paper
Add Code

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

no code implementations • 30 Apr 2020 • Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler

For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.

Machine Translation reinforcement-learning +3

Paper
Add Code

WoodFisher: Efficient Second-Order Approximation for Neural Network Compression

1 code implementation • NeurIPS 2020 • Sidak Pal Singh, Dan Alistarh

Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems.

Image Classification Neural Network Compression

Paper
Code

Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers

1 code implementation • 20 Mar 2020 • Dan Alistarh, Nikita Koval, Giorgi Nadiradze

We show that, for algorithms such as Delaunay mesh triangulation and sorting by insertion, schedulers with a maximum relaxation factor of $k$ in terms of the maximum priority inversion allowed will introduce a maximum amount of wasted work of $O(log(n) poly (k) ), $ where $n$ is the number of tasks to be executed.

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing

Paper
Code

Relaxed Scheduling for Scalable Belief Propagation

no code implementations • 25 Feb 2020 • Vitaly Aksenov, Dan Alistarh, Janne H. Korhonen

The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning.

BIG-bench Machine Learning Scheduling

Paper
Add Code

On the Sample Complexity of Adversarial Multi-Source PAC Learning

no code implementations • ICML 2020 • Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert

We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms.

PAC learning

Paper
Add Code

New Bounds For Distributed Mean Estimation and Variance Reduction

no code implementations • ICLR 2021 • Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, Dan Alistarh

We provide a method of quantization which allows distributed mean estimation to be performed with solution quality dependent only on the distance between inputs, not on input norm, and show an analogous result for distributed variance reduction.

Distributed Optimization Quantization

Paper
Add Code

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

no code implementations • 16 Jan 2020 • Giorgi Nadiradze, Ilia Markov, Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh

Our framework, called elastic consistency enables us to derive convergence bounds for a variety of distributed SGD methods used in practice to train large-scale machine learning models.

BIG-bench Machine Learning

Paper
Add Code

Asynchronous Decentralized SGD with Quantized and Local Updates

no code implementations • NeurIPS 2021 • Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, Dan Alistarh

Perhaps surprisingly, we show that a variant of SGD called \emph{SwarmSGD} still converges in this setting, even if \emph{non-blocking communication}, \emph{quantization}, and \emph{local steps} are all applied \emph{in conjunction}, and even if the node data distributions and underlying graph topology are both \emph{heterogenous}.

Blocking Distributed Optimization +2

Paper
Add Code

Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

no code implementations • 25 Sep 2019 • Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy

As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed on clusters to perform model fitting in parallel.

Quantization

Paper
Add Code

Asynchronous Stochastic Subgradient Methods for General Nonsmooth Nonconvex Optimization

no code implementations • 25 Sep 2019 • Vyacheslav Kungurtsev, Malcolm Egan, Bapi Chatterjee, Dan Alistarh

This is all the more surprising since these objectives are the ones appearing in the training of deep neural networks.

Scheduling

Paper
Add Code

Powerset Convolutional Neural Networks

1 code implementation • NeurIPS 2019 • Chris Wendler, Dan Alistarh, Markus Püschel

We present a novel class of convolutional neural networks (CNNs) for set functions, i. e., data indexed with the powerset of a finite set.

Paper
Code

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

no code implementations • 12 Aug 2019 • Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself.

Paper
Add Code

MLSys: The New Frontier of Machine Learning Systems

no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar

Machine learning (ML) techniques are enjoying rapidly increasing adoption.

BIG-bench Machine Learning

Paper
Add Code

Distributed Learning over Unreliable Networks

no code implementations • 17 Oct 2018 • Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu

Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e. g., gradients or models), the network should guarantee the delivery of the message.

BIG-bench Machine Learning

Paper
Add Code

The Convergence of Sparsified Gradient Methods

no code implementations • NeurIPS 2018 • Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, Cédric Renggli

Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.

Quantization

Paper
Add Code

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory

no code implementations • 23 Mar 2018 • Dan Alistarh, Christopher De Sa, Nikola Konstantinov

Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks.

BIG-bench Machine Learning

Paper
Add Code

Byzantine Stochastic Gradient Descent

no code implementations • NeurIPS 2018 • Dan Alistarh, Zeyuan Allen-Zhu, Jerry Li

This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially.

Stochastic Optimization

Paper
Add Code

SparCML: High-Performance Sparse Communication for Machine Learning

no code implementations • 22 Feb 2018 • Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, Torsten Hoefler

This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads.

BIG-bench Machine Learning Blocking +1

Paper
Add Code

Model compression via distillation and quantization

5 code implementations • ICLR 2018 • Antonio Polino, Razvan Pascanu, Dan Alistarh

Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning.

Model Compression Quantization

4,302

Paper
Code

Compressive Sensing Using Iterative Hard Thresholding with Low Precision Data Representation: Theory and Applications

no code implementations • 14 Feb 2018 • Nezihe Merve Gürel, Kaan Kara, Alen Stojanov, Tyler Smith, Thomas Lemmin, Dan Alistarh, Markus Püschel, Ce Zhang

Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems.

Astronomy Compressive Sensing +1

Paper
Add Code

DataBright: Towards a Global Exchange for Decentralized Data Ownership and Trusted Computation

1 code implementation • 13 Feb 2018 • David Dao, Dan Alistarh, Claudiu Musat, Ce Zhang

We illustrate that trusted computation can enable the creation of an AI market, where each data point has an exact value that should be paid to its creator.

BIG-bench Machine Learning

Paper
Code

ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning

no code implementations • ICML 2017 • Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, Ce Zhang

We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees?

Quantization

Paper
Add Code

The Power of Choice in Priority Scheduling

1 code implementation • 13 Jun 2017 • Dan Alistarh, Justin Kopinsky, Jerry Li, Giorgi Nadiradze

We answer this question, showing that this strategy provides surprisingly strong guarantees: Although the single-choice process, where we always insert and remove from a single randomly chosen queue, has degrading cost, going to infinity as we increase the number of steps, in the two choice process, the expected rank of a removed element is $O( n )$ while the expected worst-case cost is $O( n \log n )$.

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing

Paper
Code

The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning

1 code implementation • 16 Nov 2016 • Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, Ce Zhang

When applied to linear models together with double sampling, we save up to another 1. 7x in data movement compared with uniform quantization.

Quantization

Paper
Code

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

2 code implementations • NeurIPS 2017 • Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, Milan Vojnovic

In this paper, we propose Quantized SGD (QSGD), a family of compression schemes which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions.

Image Classification Quantization +2

Paper
Code

Streaming Min-max Hypergraph Partitioning

no code implementations • NeurIPS 2015 • Dan Alistarh, Jennifer Iglesias, Milan Vojnovic

In many applications, the data is of rich structure that can be represented by a hypergraph, where the data items are represented by vertices and the associations among items are represented by hyperedges.

Clustering hypergraph partitioning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.