no code implementations • 29 Oct 2024 • Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, Sanjiv Kumar
By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers' abilities to simulate learning algorithms like gradient descent.
no code implementations • 22 Oct 2024 • Khashayar Gatmiry, Jon Schneider, Stefanie Jegelka
Follow-the-Regularized-Leader (FTRL) algorithms are a popular class of learning algorithms for online linear optimization (OLO) that guarantee sub-linear regret, but the choice of regularizer can significantly impact dimension-dependent factors in the regret bound.
no code implementations • 21 Oct 2024 • Khashayar Gatmiry, Zhiyuan Li, Sashank J. Reddi, Stefanie Jegelka
To obtain this result, our main technical contribution is to show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks.
no code implementations • 10 Oct 2024 • Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, Sanjiv Kumar
To our knowledge, this is the first theoretical analysis for multi-layer Transformer in this setting.
no code implementations • 19 Sep 2024 • Muthu Chidambaram, Khashayar Gatmiry, Sitan Chen, Holden Lee, Jianfeng Lu
The use of guidance in diffusion models was originally motivated by the premise that the guidance-modified score is that of the data distribution tilted by a conditional likelihood raised to some power.
no code implementations • 30 Jun 2024 • Khashayar Gatmiry, Jon Schneider
We study a variant of prediction with expert advice where the learner's action at round $t$ is only allowed to depend on losses on a specific subset of the rounds (where the structure of which rounds' losses are visible at time $t$ is provided by a directed "feedback graph" known to the learner).
no code implementations • 29 Apr 2024 • Khashayar Gatmiry, Jonathan Kelner, Holden Lee
We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption.
no code implementations • 22 Aug 2023 • Amirhossein Reisizadeh, Khashayar Gatmiry, Asuman Ozdaglar
In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client.
no code implementations • 24 Jun 2023 • Haoyuan Sun, Khashayar Gatmiry, Kwangjun Ahn, Navid Azizan
However, the implicit regularization of different algorithms are confined to either a specific geometry or a particular class of learning problems, indicating a gap in a general approach for controlling the implicit regularization.
no code implementations • 22 Jun 2023 • Khashayar Gatmiry, Zhiyuan Li, Ching-Yao Chuang, Sashank Reddi, Tengyu Ma, Stefanie Jegelka
Recent works on over-parameterized neural networks have shown that the stochasticity in optimizers has the implicit regularization effect of minimizing the sharpness of the loss function (in particular, the trace of its Hessian) over the family zero-loss solutions.
no code implementations • 10 Apr 2023 • Yuansi Chen, Khashayar Gatmiry
We analyze the mixing time of Metropolized Hamiltonian Monte Carlo (HMC) with the leapfrog integrator to sample from a distribution on $\mathbb{R}^d$ whose log-density is smooth, has Lipschitz Hessian in Frobenius norm and satisfies isoperimetry.
no code implementations • 8 Apr 2023 • Yuansi Chen, Khashayar Gatmiry
We study the mixing time of Metropolis-Adjusted Langevin algorithm (MALA) for sampling a target density on $\mathbb{R}^d$.
no code implementations • 1 Mar 2023 • Khashayar Gatmiry, Jonathan Kelner, Santosh S. Vempala
We introduce a hybrid of the Lewis weights barrier and the standard logarithmic barrier and prove that the mixing rate for the corresponding RHMC is bounded by $\tilde O(m^{1/3}n^{4/3})$, improving on the previous best bound of $\tilde O(mn^{2/3})$ (based on the log barrier).
no code implementations • 28 Dec 2022 • Tasuku Soma, Khashayar Gatmiry, Stefanie Jegelka
Distributionally robust optimization (DRO) can improve the robustness and fairness of learning methods.
no code implementations • 16 Nov 2022 • Khashayar Gatmiry, Thomas Kesselheim, Sahil Singla, Yifan Wang
The goal is to minimize the regret, which is the difference over $T$ rounds in the total value of the optimal algorithm that knows the distributions vs. the total value of our algorithm that learns the distributions from the partial feedback.
no code implementations • 2 Nov 2022 • Zakaria Mhammedi, Khashayar Gatmiry
Typical algorithms for these settings, such as the Online Newton Step (ONS), can guarantee a $O(d\ln T)$ bound on their regret after $T$ rounds, where $d$ is the dimension of the feasible set.
no code implementations • 16 Aug 2022 • Nisha Chandramoorthy, Andreas Loukas, Khashayar Gatmiry, Stefanie Jegelka
To reduce this discrepancy between theory and practice, this paper focuses on the generalization of neural networks whose training dynamics do not necessarily converge to fixed points.
no code implementations • 22 Apr 2022 • Khashayar Gatmiry, Santosh S. Vempala
We study the Riemannian Langevin Algorithm for the problem of sampling from a distribution with density $\nu$ with respect to the natural measure on a manifold with metric $g$.
no code implementations • ICLR 2022 • Khashayar Gatmiry, Stefanie Jegelka, Jonathan Kelner
While there has been substantial recent work studying generalization of neural networks, the ability of deep nets in automating the process of feature extraction still evades a thorough mathematical understanding.
no code implementations • NeurIPS 2020 • Khashayar Gatmiry, Maryam Aliakbarpour, Stefanie Jegelka
Determinantal point processes (DPPs) are popular probabilistic models of diversity.
no code implementations • 19 Nov 2018 • Khashayar Gatmiry, Manuel Gomez-Rodriguez
Then, we show that the same greedy algorithm offers a constant approximation factor of $(1 + 1/(1-\alpha))^{-1}$, where $\alpha$ is the generalized curvature of the function.