no code implementations • 4 Feb 2025 • Zhiyuan Wu, Changkyu Choi, Xiangcheng Cao, Volkan Cevher, Ali Ramezani-Kebrya
In multi-node learning environments, VRLS further extends its capabilities by learning and adapting density ratios across nodes, effectively mitigating label shifts and improving overall model performance.
1 code implementation • 17 Aug 2023 • Ali Ramezani-Kebrya, Kimon Antonakopoulos, Igor Krawczuk, Justin Deschenaux, Volkan Cevher
We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local stochastic dual vectors.
no code implementations • 8 Jun 2023 • Ali Ramezani-Kebrya, Fanghui Liu, Thomas Pethick, Grigorios Chrysos, Volkan Cevher
This paper addresses intra-client and inter-client covariate shifts in federated learning (FL) with a focus on the overall generalization performance.
no code implementations • 16 Jul 2022 • Ali Ramezani-Kebrya, Iman Tabrizian, Fartash Faghri, Petar Popovski
We introduce MixTailor, a scheme based on randomization of the aggregation strategies that makes it impossible for the attacker to be fully informed.
no code implementations • NeurIPS 2021 • ChaeHwan Song, Ali Ramezani-Kebrya, Thomas Pethick, Armin Eftekhari, Volkan Cevher
Overparameterization refers to the important phenomenon where the width of a neural network is chosen such that learning algorithms can provably attain zero loss in nonconvex training.
no code implementations • 29 Sep 2021 • Paul Rolland, Ali Ramezani-Kebrya, ChaeHwan Song, Fabian Latorre, Volkan Cevher
Despite the non-convex landscape, first-order methods can be shown to reach global minima when training overparameterized neural networks, where the number of parameters far exceed the number of training data.
no code implementations • 28 Apr 2021 • Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training.
no code implementations • 26 Feb 2021 • Ali Ramezani-Kebrya, Ashish Khisti, Ben Liang
While momentum-based methods, in conjunction with stochastic gradient descent (SGD), are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods.
1 code implementation • NeurIPS 2020 • Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel Roy, Ali Ramezani-Kebrya
Many communication-efficient variants of SGD use gradient quantization schemes.
no code implementations • 25 Sep 2019 • Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed on clusters to perform model fitting in parallel.
1 code implementation • 16 Aug 2019 • Ali Ramezani-Kebrya, Fartash Faghri, Daniel M. Roy
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed on clusters to perform model fitting in parallel.
no code implementations • ICLR 2019 • Ali Ramezani-Kebrya, Ashish Khisti, and Ben Liang
While momentum-based methods, in conjunction with the stochastic gradient descent, are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods.
no code implementations • 12 Sep 2018 • Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang
While momentum-based accelerated variants of stochastic gradient descent (SGD) are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods.