no code implementations • 18 Feb 2024 • Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai
We show that an attention unit learns a window that it uses to implement a nearest-neighbors predictor adapted to the landscape of the pretraining tasks.
no code implementations • 12 Feb 2024 • Jincheng Cao, Ruichen Jiang, Erfan Yazdandoost Hamedani, Aryan Mokhtari
In this paper, we focus on simple bilevel optimization problems, where we minimize a convex smooth objective function over the optimal solution set of another convex smooth constrained optimization problem.
no code implementations • 5 Jan 2024 • Ruichen Jiang, Parameswaran Raman, Shoham Sabach, Aryan Mokhtari, Mingyi Hong, Volkan Cevher
In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of ${O}\left(\frac{1}{mk}+\frac{1}{k^2}\right)$ for solving convex optimization problems.
no code implementations • 13 Jul 2023 • Liam Collins, Hamed Hassani, Mahdi Soltanolkotabi, Aryan Mokhtari, Sanjay Shakkottai
An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network.
no code implementations • 27 Jun 2023 • Zhan Gao, Aryan Mokhtari, Alec Koppel
Interestingly, our established non-asymptotic superlinear convergence rate demonstrates an explicit trade-off between the convergence speed and memory requirement, which to our knowledge, is the first of its kind.
no code implementations • 16 Feb 2023 • Ruichen Jiang, Qiujiang Jin, Aryan Mokhtari
Quasi-Newton algorithms are among the most popular iterative methods for solving unconstrained minimization problems, largely due to their favorable superlinear convergence property.
no code implementations • 15 Feb 2023 • Advait Parulekar, Liam Collins, Karthikeyan Shanmugam, Aryan Mokhtari, Sanjay Shakkottai
The goal of contrasting learning is to learn a representation that preserves underlying clusters by keeping samples with similar content, e. g. the ``dogness'' of a dog, close to each other in the space generated by the representation.
no code implementations • 11 Jan 2023 • Parikshit Hegde, Gustavo de Veciana, Aryan Mokhtari
In order to achieve the dual goals of privacy and learning across distributed data, Federated Learning (FL) systems rely on frequent exchanges of large files (model updates) between a set of clients and the server.
no code implementations • 2 Sep 2022 • Mao Ye, Ruichen Jiang, Haoxiang Wang, Dhruv Choudhary, Xiaocong Du, Bhargav Bhushanam, Aryan Mokhtari, Arun Kejariwal, Qiang Liu
One of the key challenges of learning an online recommendation model is the temporal domain shift, which causes the mismatch between the training and testing data distribution and hence domain generalization error.
1 code implementation • 17 Jun 2022 • Ruichen Jiang, Nazanin Abolfazli, Aryan Mokhtari, Erfan Yazdandoost Hamedani
To the best of our knowledge, our method achieves the best-known iteration complexity for the considered class of bilevel problems.
1 code implementation • 5 Jun 2022 • Isidoros Tziotis, Zebang Shen, Ramtin Pedarsani, Hamed Hassani, Aryan Mokhtari
Federated Learning is an emerging learning paradigm that allows training models from samples distributed across a large network of clients while respecting privacy and communication restrictions.
no code implementations • 27 May 2022 • Liam Collins, Hamed Hassani, Aryan Mokhtari, Sanjay Shakkottai
We show that the reason behind generalizability of the FedAvg's output is its power in learning the common data representation among the clients' tasks, by leveraging the diversity among client data distributions via local updates.
no code implementations • 19 Feb 2022 • Ruichen Jiang, Aryan Mokhtari
In this paper, we follow this approach and distill the underlying idea of optimism to propose a generalized optimistic method, which includes the optimistic gradient method as a special case.
no code implementations • 11 Feb 2022 • Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari, Sanjay Shakkottai, Rachel Ward
We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives.
no code implementations • 7 Feb 2022 • Liam Collins, Aryan Mokhtari, Sewoong Oh, Sanjay Shakkottai
Recent empirical evidence has driven conventional wisdom to believe that gradient-based meta-learning (GBML) methods perform well at few-shot learning because they learn an expressive data representation that is shared across tasks.
no code implementations • 1 Nov 2021 • Arman Adibi, Aryan Mokhtari, Hamed Hassani
Prior literature has thus far mainly focused on studying such problems in the continuous domain, e. g., convex-concave minimax optimization is now understood to a significant extent.
no code implementations • NeurIPS 2021 • Qiujiang Jin, Aryan Mokhtari
In this paper, we use an adaptive sample size scheme that exploits the superlinear convergence of quasi-Newton methods globally and throughout the entire learning process.
3 code implementations • 14 Feb 2021 • Liam Collins, Hamed Hassani, Aryan Mokhtari, Sanjay Shakkottai
Based on this intuition, we propose a novel federated learning framework and algorithm for learning a shared data representation across clients and unique local heads for each client.
no code implementations • NeurIPS 2021 • Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar
In this paper, we study the generalization properties of Model-Agnostic Meta-Learning (MAML) algorithms for supervised learning problems.
no code implementations • 28 Dec 2020 • Amirhossein Reisizadeh, Isidoros Tziotis, Hamed Hassani, Aryan Mokhtari, Ramtin Pedarsani
Federated Learning is a novel paradigm that involves learning from data samples distributed across a large network of clients while the data remains local.
2 code implementations • NeurIPS 2020 • Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar
In this paper, we study a personalized variant of the federated learning in which our goal is to find an initial shared model that current or new users can easily adapt to their local dataset by performing one or a few steps of gradient descent with respect to their own data.
no code implementations • NeurIPS 2020 • Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari
In this paper we study the problem of escaping from saddle points and achieving second-order optimality in a decentralized setting where a group of agents collaborate to minimize their aggregate objective function.
no code implementations • 27 Oct 2020 • Liam Collins, Aryan Mokhtari, Sanjay Shakkottai
Model-Agnostic Meta-Learning (MAML) has become increasingly popular for training models that can quickly adapt to new tasks via one or few stochastic gradient descent steps.
1 code implementation • NeurIPS 2020 • Arman Adibi, Aryan Mokhtari, Hamed Hassani
Motivated by this terminology, we propose a novel meta-learning framework in the discrete domain where each task is equivalent to maximizing a set function under a cardinality constraint.
1 code implementation • 2 Jul 2020 • Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari, Mehrdad Mahdavi
In federated learning, communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions.
no code implementations • 23 Jun 2020 • Mohammad Fereydounian, Zebang Shen, Aryan Mokhtari, Amin Karbasi, Hamed Hassani
More precisely, by assuming that Reliable-FW has access to a (stochastic) gradient oracle of the objective function and a noisy feasibility oracle of the safety polytope, it finds an $\epsilon$-approximate first-order stationary point with the optimal ${\mathcal{O}}({1}/{\epsilon^2})$ gradient oracle complexity (resp.
no code implementations • 7 Jun 2020 • Aryan Mokhtari, Leyla Sadighi, Behnam Bahrak, Mojtaba Eshghie
In this paper, a new hybrid method is proposed based on various anomaly detection methods such as GARCH, K-means, and Neural Network to determine the anomalous data.
no code implementations • 30 Mar 2020 • Qiujiang Jin, Aryan Mokhtari
In this paper, we provide a finite-time (non-asymptotic) convergence analysis for Broyden quasi-Newton algorithms under the assumptions that the objective function is strongly convex, its gradient is Lipschitz continuous, and its Hessian is Lipschitz continuous at the optimal solution.
no code implementations • ICML 2020 • Hossein Taheri, Aryan Mokhtari, Hamed Hassani, Ramtin Pedarsani
We consider a decentralized stochastic learning problem where data points are distributed among computing nodes communicating over a directed graph.
no code implementations • 19 Feb 2020 • Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar
In this paper, we study a personalized variant of the federated learning in which our goal is to find an initial shared model that current or new users can easily adapt to their local dataset by performing one or a few steps of gradient descent with respect to their own data.
1 code implementation • NeurIPS 2021 • Alireza Fallah, Kristian Georgiev, Aryan Mokhtari, Asuman Ozdaglar
We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems, where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of stochastic policy gradient for the realized MDP.
no code implementations • NeurIPS 2020 • Liam Collins, Aryan Mokhtari, Sanjay Shakkottai
Meta-learning methods have shown an impressive ability to train models that rapidly learn new tasks.
no code implementations • NeurIPS 2019 • Amin Karbasi, Hamed Hassani, Aryan Mokhtari, Zebang Shen
Concretely, for a monotone and continuous DR-submodular function, \SCGPP achieves a tight $[(1-1/e)\OPT -\epsilon]$ solution while using $O(1/\epsilon^2)$ stochastic gradients and $O(1/\epsilon)$ calls to the linear optimization oracle.
no code implementations • 31 Oct 2019 • Weijie Liu, Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil, Zebang Shen, Nenggan Zheng
In this paper, we focus on solving a class of constrained non-convex non-concave saddle point problems in a decentralized manner by a group of nodes in a network.
no code implementations • 10 Oct 2019 • Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, Amin Karbasi
One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications.
no code implementations • 28 Sep 2019 • Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, Ramtin Pedarsani
Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized.
no code implementations • 27 Aug 2019 • Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar
We study the convergence of a class of gradient-based Model-Agnostic Meta-Learning (MAML) methods and characterize their overall complexity as well as their best achievable accuracy in terms of gradient norm for nonconvex loss functions.
1 code implementation • NeurIPS 2019 • Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, Ramtin Pedarsani
We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively.
no code implementations • 3 Jun 2019 • Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil
To do so, we first show that both OGDA and EG can be interpreted as approximate variants of the proximal point method.
no code implementations • 19 Feb 2019 • Hamed Hassani, Amin Karbasi, Aryan Mokhtari, Zebang Shen
It is known that this rate is optimal in terms of stochastic gradient evaluations.
no code implementations • 17 Feb 2019 • Mingrui Zhang, Lin Chen, Aryan Mokhtari, Hamed Hassani, Amin Karbasi
How can we efficiently mitigate the overhead of gradient communications in distributed optimization?
no code implementations • 24 Jan 2019 • Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil
In this paper we consider solving saddle point problems using two variants of Gradient Descent-Ascent algorithms, Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods.
no code implementations • 26 Oct 2018 • Majid Jahani, Xi He, Chenxin Ma, Aryan Mokhtari, Dheevatsa Mudigere, Alejandro Ribeiro, Martin Takáč
In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy.
no code implementations • NeurIPS 2018 • Aryan Mokhtari, Asuman Ozdaglar, Ali Jadbabaie
We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set $\mathcal{C}$ is simple for a quadratic objective function.
no code implementations • 29 Jun 2018 • Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ramtin Pedarsani
We consider the problem of decentralized consensus optimization, where the sum of $n$ smooth and strongly convex functions are minimized over $n$ distributed agents that form a connected network.
no code implementations • ICML 2018 • Zebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, Hui Qian
Recently, the decentralized optimization problem is attracting growing attention.
no code implementations • NeurIPS 2018 • Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, Ali Jadbabaie
We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method.
no code implementations • 24 Apr 2018 • Aryan Mokhtari, Hamed Hassani, Amin Karbasi
Further, for a monotone and continuous DR-submodular function and subject to a general convex body constraint, we prove that our proposed method achieves a $((1-1/e)OPT-\eps)$ guarantee with $O(1/\eps^3)$ stochastic gradient computations.
no code implementations • 5 Nov 2017 • Aryan Mokhtari, Hamed Hassani, Amin Karbasi
More precisely, for a monotone and continuous DR-submodular function and subject to a \textit{general} convex body constraint, we prove that \alg achieves a $[(1-1/e)\text{OPT} -\eps]$ guarantee (in expectation) with $\mathcal{O}{(1/\eps^3)}$ stochastic gradient computations.
no code implementations • NeurIPS 2017 • Aryan Mokhtari, Alejandro Ribeiro
Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods.
no code implementations • 22 May 2017 • Mark Eisen, Aryan Mokhtari, Alejandro Ribeiro
In this paper, we propose a novel adaptive sample size second-order method, which reduces the cost of computing the Hessian by solving a sequence of ERM problems corresponding to a subset of samples and lowers the cost of computing the Hessian inverse using a truncated eigenvalue decomposition.
no code implementations • 2 Feb 2017 • Aryan Mokhtari, Mark Eisen, Alejandro Ribeiro
This makes their computational cost per iteration independent of the number of objective functions $n$.
no code implementations • 1 Nov 2016 • Aryan Mokhtari, Mert Gürbüzbalaban, Alejandro Ribeiro
We prove that not only the proposed DIAG method converges linearly to the optimal solution, but also its linear convergence factor justifies the advantage of incremental methods on GD.
no code implementations • 7 Oct 2016 • Tianyi Chen, Aryan Mokhtari, Xin Wang, Alejandro Ribeiro, Georgios B. Giannakis
Existing approaches to resource allocation for nowadays stochastic networks are challenged to meet fast convergence and tolerable delay requirements.
no code implementations • 15 Jun 2016 • Aryan Mokhtari, Alec Koppel, Alejandro Ribeiro
Algorithms that are parallel in either of these dimensions exist, but RAPSA is the first attempt at a methodology that is parallel in both the selection of blocks and the selection of elements of the training set.
no code implementations • NeurIPS 2016 • Aryan Mokhtari, Alejandro Ribeiro
We consider empirical risk minimization for large-scale datasets.
no code implementations • 23 Mar 2016 • Mark Eisen, Aryan Mokhtari, Alejandro Ribeiro
The resulting dual D-BFGS method is a fully decentralized algorithm in which nodes approximate curvature information of themselves and their neighbors through the satisfaction of a secant condition.
no code implementations • 22 Mar 2016 • Aryan Mokhtari, Alec Koppel, Alejandro Ribeiro
Algorithms that are parallel in either of these dimensions exist, but RAPSA is the first attempt at a methodology that is parallel in both, the selection of blocks and the selection of elements of the training set.
no code implementations • 16 Mar 2016 • Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, Alejandro Ribeiro
In this paper, we address tracking of a time-varying parameter with unknown dynamics.
no code implementations • 13 Jun 2015 • Aryan Mokhtari, Alejandro Ribeiro
The decentralized double stochastic averaging gradient (DSA) algorithm is proposed as a solution alternative that relies on: (i) The use of local stochastic averaging gradients.
Optimization and Control
no code implementations • 6 Sep 2014 • Aryan Mokhtari, Alejandro Ribeiro
Global convergence of an online (stochastic) limited memory version of the Broyden-Fletcher- Goldfarb-Shanno (BFGS) quasi-Newton method for solving optimization problems with stochastic objectives that arise in large scale machine learning is established.
no code implementations • 20 Feb 2014 • Aryan Mokhtari, Alejandro Ribeiro
This paper adapts a recently developed regularized stochastic version of the Broyden, Fletcher, Goldfarb, and Shanno (BFGS) quasi-Newton method for the solution of support vector machine classification problems.
no code implementations • 29 Jan 2014 • Aryan Mokhtari, Alejandro Ribeiro
Numerical experiments showcase reductions in convergence time relative to stochastic gradient descent algorithms and non-regularized stochastic versions of BFGS.