no code implementations • 31 Oct 2023 • Rustem Islamov, Mher Safaryan, Dan Alistarh
As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD.
1 code implementation • NeurIPS 2023 • Mher Safaryan, Alexandra Peste, Dan Alistarh
We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism.
1 code implementation • 28 Oct 2022 • Artavazd Maranjyan, Mher Safaryan, Peter Richtárik
We study a class of distributed optimization algorithms that aim to alleviate high communication costs by allowing the clients to perform multiple local gradient-type training steps prior to communication.
no code implementations • 7 Jun 2022 • Rustem Islamov, Xun Qian, Slavomír Hanzely, Mher Safaryan, Peter Richtárik
Despite their high computation and communication costs, Newton-type methods remain an appealing option for distributed training due to their robustness against ill-conditioned convex problems.
no code implementations • 2 Nov 2021 • Xun Qian, Rustem Islamov, Mher Safaryan, Peter Richtárik
Recent advances in distributed optimization have shown that Newton-type methods with proper communication compression mechanisms can guarantee fast local rates and low communication cost compared to first order methods.
no code implementations • 7 Jun 2021 • Bokun Wang, Mher Safaryan, Peter Richtárik
To address the high communication costs of distributed machine learning, a large body of work has been devoted in recent years to designing various compression strategies, such as sparsification and quantization, and optimization algorithms capable of using them.
no code implementations • 5 Jun 2021 • Mher Safaryan, Rustem Islamov, Xun Qian, Peter Richtárik
In contrast to the aforementioned work, FedNL employs a different Hessian learning technique which i) enhances privacy as it does not rely on the training data to be revealed to the coordinating server, ii) makes it applicable beyond generalized linear models, and iii) provably works with general contractive compression operators for compressing the local Hessians, such as Top-$K$ or Rank-$R$, which are vastly superior in practice.
no code implementations • NeurIPS 2021 • Mher Safaryan, Filip Hanzely, Peter Richtárik
In order to further alleviate the communication burden inherent in distributed optimization, we propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses.
no code implementations • 7 Oct 2020 • Alyazeed Albasyoni, Mher Safaryan, Laurent Condat, Peter Richtárik
In the average-case analysis, we design a simple compression operator, Spherical Compression, which naturally achieves the lower bound.
no code implementations • 27 Feb 2020 • Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, Mher Safaryan
In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning.
no code implementations • 20 Feb 2020 • Mher Safaryan, Egor Shulgin, Peter Richtárik
In designing a compression method, one aims to communicate as few bits as possible, which minimizes the cost per communication round, while at the same time attempting to impart as little distortion (variance) to the communicated messages as possible, which minimizes the adverse effect of the compression on the overall number of communication rounds.
no code implementations • 25 Sep 2019 • Mher Safaryan, Peter Richtárik
Various gradient compression schemes have been proposed to mitigate the communication cost in distributed training of large scale machine learning models.