no code implementations • 30 May 2023 • Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi
Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being popular choices for cycling through random or single permutations of the training data.
no code implementations • 2 May 2023 • Anastasia Koloskova, Hadrien Hendrikx, Sebastian U. Stich
In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes.
no code implementations • 3 Jan 2023 • Yue Liu, Tao Lin, Anastasia Koloskova, Sebastian U. Stich
Gradient tracking (GT) is an algorithm designed for solving decentralized optimization problems over a network (such as training a machine learning model).
no code implementations • CVPR 2023 • Bo Li, Mikkel N. Schmidt, Tommy S. Alstrøm, Sebastian U. Stich
In this paper, we first revisit the widely used FedAvg algorithm in a deep neural network to understand how data heterogeneity influences the gradient updates across the neural network layers.
no code implementations • 5 Dec 2022 • Bo Li, Mikkel N. Schmidt, Tommy S. Alstrøm, Sebastian U. Stich
In this paper, we first revisit the widely used FedAvg algorithm in a deep neural network to understand how data heterogeneity influences the gradient updates across the neural network layers.
no code implementations • 16 Jun 2022 • Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi
In this work (i) we obtain a tighter convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \sqrt{\tau_{\max}\tau_{avg}}\epsilon^{-1}\right)$ without any change in the algorithm where $\tau_{avg}$ is the average delay, which can be significantly smaller than $\tau_{\max}$.
no code implementations • 13 Apr 2022 • Yatin Dandi, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich
Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs.
no code implementations • 18 Feb 2022 • Harsh Vardhan, Sebastian U. Stich
Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning.
no code implementations • NeurIPS 2021 • Anastasia Koloskova, Tao Lin, Sebastian U. Stich
We consider decentralized machine learning over a network where the training data is distributed across $n$ agents, each of which can compute stochastic model updates on their local data.
1 code implementation • 9 Dec 2021 • Yehao Liu, Matteo Pagliardini, Tatjana Chavdarova, Sebastian U. Stich
Secondly, we show on a 2D toy example that both BNNs and MCDropout do not give high uncertainty estimates on OOD samples.
no code implementations • NeurIPS 2021 • Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian U. Stich, Ananda Theertha Suresh
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.
1 code implementation • 10 Nov 2021 • El Mahdi Chayti, Sai Praneeth Karimireddy, Sebastian U. Stich, Nicolas Flammarion, Martin Jaggi
Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user).
1 code implementation • 11 Oct 2021 • Hui-Po Wang, Sebastian U. Stich, Yang He, Mario Fritz
Federated learning is a powerful distributed learning scheme that allows numerous edge devices to collaboratively train a model without sharing their data.
1 code implementation • NeurIPS 2021 • Thijs Vogels, Lie He, Anastasia Koloskova, Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi
A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions.
no code implementations • 6 Sep 2021 • Sebastian Bischoff, Stephan Günnemann, Martin Jaggi, Sebastian U. Stich
We consider federated learning (FL), where the training data is distributed across a large number of clients.
1 code implementation • ICCV 2021 • Oguz Kaan Yuksel, Sebastian U. Stich, Martin Jaggi, Tatjana Chavdarova
We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective, yielding the first test accuracy improvement results on real-world datasets -- CIFAR-10/100 -- via latent-space perturbations.
2 code implementations • 14 Jul 2021 • Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Konecny, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtarik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, Wennan Zhu
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection.
no code implementations • 16 Jun 2021 • Amirkeivan Mohtashami, Martin Jaggi, Sebastian U. Stich
State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD).
no code implementations • 3 Mar 2021 • Sebastian U. Stich, Amirkeivan Mohtashami, Martin Jaggi
It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness.
1 code implementation • 9 Feb 2021 • Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi
In this paper, we investigate and identify the limitation of several decentralized optimization algorithms for different degrees of data heterogeneity.
no code implementations • 9 Feb 2021 • Lingjing Kong, Tao Lin, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich
Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters.
no code implementations • 3 Nov 2020 • Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter Richtarik, Sebastian U. Stich
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator.
no code implementations • 4 Sep 2020 • Sebastian U. Stich
Lossy gradient compression, with either unbiased or biased compressors, has become a key tool to avoid the communication bottleneck in centrally coordinated distributed training of machine learning models.
1 code implementation • 8 Aug 2020 • Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.
no code implementations • 31 Jul 2020 • Ahmad Ajalloeian, Sebastian U. Stich
We analyze the complexity of biased stochastic gradient methods (SGD), where individual updates are corrupted by deterministic, i. e. biased error terms.
1 code implementation • ICLR 2021 • Tatjana Chavdarova, Matteo Pagliardini, Sebastian U. Stich, Francois Fleuret, Martin Jaggi
Generative Adversarial Networks are notoriously challenging to train.
no code implementations • ICLR 2020 • Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, Martin Jaggi
Deep neural networks often have millions of parameters.
1 code implementation • NeurIPS 2020 • Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi
In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.
no code implementations • ICML 2020 • Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi
Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data.
no code implementations • ICML 2020 • Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, Sebastian U. Stich
Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency.
no code implementations • ICML 2020 • Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro
We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method.
8 code implementations • 10 Dec 2019 • Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, Sen Zhao
FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches.
7 code implementations • ICML 2020 • Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh
We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence.
no code implementations • 11 Sep 2019 • Sebastian U. Stich, Sai Praneeth Karimireddy
We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates.
1 code implementation • ICLR 2020 • Anastasia Koloskova, Tao Lin, Sebastian U. Stich, Martin Jaggi
Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters.
no code implementations • 9 Jul 2019 • Sebastian U. Stich
In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $\mu$-convex functions under a (milder than standard) $L$-smoothness assumption.
3 code implementations • 1 Feb 2019 • Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi
We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix.
1 code implementation • 28 Jan 2019 • Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi
These issues arise because of the biased nature of the sign compression operator.
no code implementations • 16 Oct 2018 • Sai Praneeth Karimireddy, Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi
For these problems we provide (i) the first linear rates of convergence independent of $n$, and show that our greedy update rule provides speedups similar to those obtained in the smooth case.
1 code implementation • NeurIPS 2018 • Sebastian U. Stich, Jean-Baptiste Cordonnier, Martin Jaggi
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i. e. algorithms that leverage the compute power of many devices for training.
2 code implementations • ICLR 2020 • Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks.
no code implementations • 1 Jun 2018 • Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi
We show that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable.
1 code implementation • ICLR 2019 • Sebastian U. Stich
Local SGD can also be used for large scale training of deep learning models.
no code implementations • 2 May 2018 • Anant Raj, Sebastian U. Stich
Variance reduced stochastic gradient (SGD) methods converge significantly faster than the vanilla SGD counterpart.
no code implementations • ICML 2018 • Francesco Locatello, Anant Raj, Sai Praneeth Karimireddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U. Stich, Martin Jaggi
Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affine invariant sublinear $\mathcal{O}(1/t)$ rates on smooth objectives and linear convergence on strongly convex objectives.
no code implementations • NeurIPS 2017 • Sebastian U. Stich, Anant Raj, Martin Jaggi
Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications.
no code implementations • ICML 2017 • Sebastian U. Stich, Anant Raj, Martin Jaggi
We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization.