Search Results for author: Maksim Velikanov

Found 10 papers, 1 papers with code

Falcon Mamba: The First Competitive Attention-free 7B Language Model

no code implementations7 Oct 2024 Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guillaume Kunsch, Hakim Hacid

It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B and RWKV-v6 Finch 7B/14B.

Language Modeling Language Modelling +2

SGD with memory: fundamental properties and stochastic acceleration

no code implementations5 Oct 2024 Dmitry Yarotsky, Maksim Velikanov

In the non-stochastic setting, the optimal exponent $\xi$ in the loss convergence $L_t\sim C_Lt^{-\xi}$ is double that in plain GD and is achievable using Heavy Ball (HB) with a suitable schedule; this no longer works in the presence of mini-batch noise.

Generalization error of spectral algorithms

no code implementations18 Mar 2024 Maksim Velikanov, Maxim Panov, Dmitry Yarotsky

In the present work, we consider the training of kernels with a family of $\textit{spectral algorithms}$ specified by profile $h(\lambda)$, and including KRR and GD as special cases.

Embedded Ensembles: Infinite Width Limit and Operating Regimes

no code implementations24 Feb 2022 Maksim Velikanov, Roman Kail, Ivan Anokhin, Roman Vashurin, Maxim Panov, Alexey Zaytsev, Dmitry Yarotsky

In this limit, we identify two ensemble regimes - independent and collective - depending on the architecture and initialization strategy of ensemble models.

Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions

no code implementations2 Feb 2022 Maksim Velikanov, Dmitry Yarotsky

In this paper, we propose a new spectral condition providing tighter upper bounds for problems with power law optimization trajectories.

Explicit loss asymptotics in the gradient descent training of neural networks

no code implementations NeurIPS 2021 Maksim Velikanov, Dmitry Yarotsky

Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values.

Universal scaling laws in the gradient descent training of neural networks

no code implementations2 May 2021 Maksim Velikanov, Dmitry Yarotsky

Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values.

Form

Cannot find the paper you are looking for? You can Submit a new open access paper.