Search Results for author: Mikhail Belkin

Found 68 papers, 14 papers with code

Average gradient outer product as a mechanism for deep neural collapse

no code implementations21 Feb 2024 Daniel Beaglehole, Peter Súkeník, Marco Mondelli, Mikhail Belkin

In this work, we provide substantial evidence that DNC formation occurs primarily through deep feature learning with the average gradient outer product (AGOP).

Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination

1 code implementation15 Feb 2024 Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić

Our results demonstrate the usefulness of this approach across different models and sizes, and also with parameter-efficient fine-tuning, offering a novel pathway to addressing the challenges with private and sensitive data in LLM applications.

Natural Language Understanding

Linear Recursive Feature Machines provably recover low-rank matrices

1 code implementation9 Jan 2024 Adityanarayanan Radhakrishnan, Mikhail Belkin, Dmitriy Drusvyatskiy

A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning.

Dimensionality Reduction Low-Rank Matrix Completion +1

On the Nystrom Approximation for Preconditioning in Kernel Machines

no code implementations6 Dec 2023 Amirhesam Abedsoltan, Parthe Pandit, Luis Rademacher, Mikhail Belkin

Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning.

Mechanism of feature learning in convolutional neural networks

1 code implementation1 Sep 2023 Daniel Beaglehole, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin

We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines.

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

no code implementations7 Jun 2023 Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD).

On Emergence of Clean-Priority Learning in Early Stopped Neural Networks

no code implementations5 Jun 2023 Chaoyue Liu, Amirhesam Abedsoltan, Mikhail Belkin

This behaviour is believed to be a result of neural networks learning the pattern of clean data first and fitting the noise later in the training, a phenomenon that we refer to as clean-priority learning.

Cut your Losses with Squentropy

no code implementations8 Feb 2023 Like Hui, Mikhail Belkin, Stephen Wright

We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy.

Classification Multi-class Classification

Toward Large Kernel Models

1 code implementation6 Feb 2023 Amirhesam Abedsoltan, Mikhail Belkin, Parthe Pandit

Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets.

Restricted Strong Convexity of Deep Learning Models with Smooth Activations

no code implementations29 Sep 2022 Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin

Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $\Omega(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss.

A Universal Trade-off Between the Model Size, Test Loss, and Training Loss of Linear Predictors

no code implementations23 Jul 2022 Nikhil Ghosh, Mikhail Belkin

Remarkably, while the Marchenko-Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, it coincides exactly with the distribution independent bound as the level of overparametrization increases.

Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting

no code implementations14 Jul 2022 Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran

In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime.

Learning Theory

A note on Linear Bottleneck networks and their Transition to Multilinearity

no code implementations30 Jun 2022 Libin Zhu, Parthe Pandit, Mikhail Belkin

In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius $O(1)$ around initialization.

On the Inconsistency of Kernel Ridgeless Regression in Fixed Dimensions

no code implementations26 May 2022 Daniel Beaglehole, Mikhail Belkin, Parthe Pandit

``Benign overfitting'', the ability of certain algorithms to interpolate noisy training data and yet perform well out-of-sample, has been a topic of considerable recent interest.

regression Translation

Quadratic models for understanding catapult dynamics of neural networks

1 code implementation24 May 2022 Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models.

Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture

no code implementations24 May 2022 Libin Zhu, Chaoyue Liu, Mikhail Belkin

In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity.

Wide and Deep Neural Networks Achieve Optimality for Classification

no code implementations29 Apr 2022 Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

In this work, we identify and construct an explicit set of neural network classifiers that achieve optimality.


Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models

no code implementations ICLR 2022 Chaoyue Liu, Libin Zhu, Mikhail Belkin

Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK), in a region containing the optimization path of gradient descent.

Limitations of Neural Collapse for Understanding Generalization in Deep Learning

no code implementations17 Feb 2022 Like Hui, Mikhail Belkin, Preetum Nakkiran

We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property).

Representation Learning

Benign Overfitting in Two-layer Convolutional Neural Networks

no code implementations14 Feb 2022 Yuan Cao, Zixiang Chen, Mikhail Belkin, Quanquan Gu

In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN).

Vocal Bursts Valence Prediction

Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size

no code implementations30 Dec 2021 Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice.

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

1 code implementation29 May 2021 Mikhail Belkin

In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges.

BIG-bench Machine Learning

On the linearity of large non-linear models: when and why the tangent kernel is constant

no code implementations NeurIPS 2020 Chaoyue Liu, Libin Zhu, Mikhail Belkin

We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width.

Linear Convergence and Implicit Regularization of Generalized Mirror Descent with Time-Dependent Mirrors

no code implementations28 Sep 2020 Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum?

Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors

no code implementations18 Sep 2020 Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad.

Multiple Descent: Design Your Own Generalization Curve

no code implementations NeurIPS 2021 Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi

This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized.


Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

no code implementations ICLR 2021 Like Hui, Mikhail Belkin

We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

no code implementations29 Feb 2020 Chaoyue Liu, Libin Zhu, Mikhail Belkin

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks.

Overparameterized Neural Networks Implement Associative Memory

1 code implementation26 Sep 2019 Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Identifying computational mechanisms for memorization and retrieval of data is a long-standing problem at the intersection of machine learning and neuroscience.

Memorization Retrieval

Overparameterized Neural Networks Can Implement Associative Memory

no code implementations25 Sep 2019 Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Identifying computational mechanisms for memorization and retrieval is a long-standing problem at the intersection of machine learning and neuroscience.

Memorization Retrieval

Downsampling leads to Image Memorization in Convolutional Autoencoders

no code implementations ICLR 2019 Adityanarayanan Radhakrishnan, Caroline Uhler, Mikhail Belkin

In this paper, we link memorization of images in deep convolutional autoencoders to downsampling through strided convolution.


Two models of double descent for weak features

no code implementations18 Mar 2019 Mikhail Belkin, Daniel Hsu, Ji Xu

The "double descent" risk curve was proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models.

BIG-bench Machine Learning Vocal Bursts Valence Prediction

Reconciling modern machine learning practice and the bias-variance trade-off

2 code implementations28 Dec 2018 Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal

This connection between the performance and the structure of machine learning models delineates the limits of classical analyses, and has implications for both the theory and practice of machine learning.

BIG-bench Machine Learning

On exponential convergence of SGD in non-convex over-parametrized learning

no code implementations6 Nov 2018 Raef Bassily, Mikhail Belkin, Siyuan Ma

Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning.

BIG-bench Machine Learning

Accelerating SGD with momentum for over-parameterized learning

1 code implementation ICLR 2020 Chaoyue Liu, Mikhail Belkin

This is in contrast to the classical results in the deterministic scenario, where the same step size ensures accelerated convergence of the Nesterov's method over optimal gradient descent.

Memorization in Overparameterized Autoencoders

no code implementations ICML Workshop Deep_Phenomen 2019 Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler

The ability of deep neural networks to generalize well in the overparameterized regime has become a subject of significant research interest.

Inductive Bias Memorization

Does data interpolation contradict statistical optimality?

no code implementations25 Jun 2018 Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov

We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.


Kernel machines that adapt to GPUs for effective large batch training

2 code implementations15 Jun 2018 Siyuan Ma, Mikhail Belkin

In this paper we develop the first analytical framework that extends linear scaling to match the parallel computing capacity of a resource.

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

no code implementations NeurIPS 2018 Mikhail Belkin, Daniel Hsu, Partha Mitra

Finally, this paper suggests a way to explain the phenomenon of adversarial examples, which are seemingly ubiquitous in modern machine learning, and also discusses some connections to kernel machines and random forests in the interpolated regime.

BIG-bench Machine Learning General Classification +1

Parametrized Accelerated Methods Free of Condition Number

no code implementations28 Feb 2018 Chaoyue Liu, Mikhail Belkin

Analyses of accelerated (momentum-based) gradient descent usually assume bounded condition number to obtain exponential convergence rates.

Fast Interactive Image Retrieval using large-scale unlabeled data

no code implementations12 Feb 2018 Akshay Mehra, Jihun Hamm, Mikhail Belkin

Active learning reduces the number of user interactions by querying the labels of the most informative points and GSSL allows to use abundant unlabeled data along with the limited labeled data provided by the user.

Active Learning Binary Classification +2

To understand deep learning we need to understand kernel learning

no code implementations ICML 2018 Mikhail Belkin, Siyuan Ma, Soumik Mandal

Certain key phenomena of deep learning are manifested similarly in kernel methods in the modern "overfitted" regime.

Generalization Bounds

Approximation beats concentration? An approximation view on inference with smooth radial kernels

no code implementations10 Jan 2018 Mikhail Belkin

We analyze eigenvalue decay of kernels operators and matrices, properties of eigenfunctions/eigenvectors and "Fourier" coefficients of functions in the kernel space restricted to a discrete set of data points.

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

no code implementations ICML 2018 Siyuan Ma, Raef Bassily, Mikhail Belkin

We show that there is a critical batch size $m^*$ such that: (a) SGD iteration with mini-batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of mini-batch size $1$ (\emph{linear scaling regime}).

Unperturbed: spectral analysis beyond Davis-Kahan

no code implementations20 Jun 2017 Justin Eldridge, Mikhail Belkin, Yusu Wang

Classical matrix perturbation results, such as Weyl's theorem for eigenvalues and the Davis-Kahan theorem for eigenvectors, are general purpose.


Diving into the shallows: a computational perspective on large-scale shallow learning

1 code implementation NeurIPS 2017 Siyuan Ma, Mikhail Belkin

An analysis based on the spectral properties of the kernel demonstrates that only a vanishingly small portion of the function space is reachable after a polynomial number of gradient descent iterations.

Learning Privately from Multiparty Data

no code implementations10 Feb 2016 Jihun Hamm, Paul Cao, Mikhail Belkin

How can we build an accurate and differentially private global classifier by combining locally-trained classifiers from different parties, without access to any party's private data?

Activity Recognition Network Intrusion Detection

Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering

no code implementations21 Jun 2015 Justin Eldridge, Mikhail Belkin, Yusu Wang

In this paper we identify two limit properties, separation and minimality, which address both over-segmentation and improper nesting and together imply (but are not implied by) Hartigan consistency.


Probabilistic Zero-shot Classification with Semantic Rankings

no code implementations27 Feb 2015 Jihun Hamm, Mikhail Belkin

In this paper we propose a non-metric ranking-based representation of semantic similarity that allows natural aggregation of semantic information from multiple heterogeneous sources.

Classification General Classification +3

A Pseudo-Euclidean Iteration for Optimal Recovery in Noisy ICA

no code implementations NeurIPS 2015 James Voss, Mikhail Belkin, Luis Rademacher

We propose a new algorithm, PEGI (for pseudo-Euclidean Gradient Iteration), for provable model recovery for ICA with Gaussian noise.

Crowd-ML: A Privacy-Preserving Learning Framework for a Crowd of Smart Devices

no code implementations11 Jan 2015 Jihun Hamm, Adam Champion, Guoxing Chen, Mikhail Belkin, Dong Xuan

Smart devices with built-in sensors, computational capabilities, and network connectivity have become increasingly pervasive.

Privacy Preserving

Learning with Fredholm Kernels

no code implementations NeurIPS 2014 Qichao Que, Mikhail Belkin, Yusu Wang

In this paper we propose a framework for supervised and semi-supervised learning based on reformulating the learning problem as a regularized Fredholm integral equation.

Eigenvectors of Orthogonally Decomposable Functions

no code implementations5 Nov 2014 Mikhail Belkin, Luis Rademacher, James Voss

It includes influential Machine Learning methods such as cumulant-based FastICA and the tensor power iteration for orthogonally decomposable tensors as special cases.

Clustering Topic Models

The Hidden Convexity of Spectral Clustering

1 code implementation4 Mar 2014 James Voss, Mikhail Belkin, Luis Rademacher

Geometrically, the proposed algorithms can be interpreted as hidden basis recovery by means of function optimization.


Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

no code implementations NeurIPS 2013 James R. Voss, Luis Rademacher, Mikhail Belkin

In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise.

The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

no code implementations12 Nov 2013 Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, James Voss

The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture.

Inverse Density as an Inverse Problem: The Fredholm Equation Approach

no code implementations NeurIPS 2013 Qichao Que, Mikhail Belkin

In this paper we address the problem of estimating the ratio $\frac{q}{p}$ where $p$ is a density function and $q$ is another density, or, more generally an arbitrary function.

Transfer Learning

Blind Signal Separation in the Presence of Gaussian Noise

no code implementations7 Nov 2012 Mikhail Belkin, Luis Rademacher, James Voss

In this paper we propose a new algorithm for solving the blind signal separation problem in the presence of additive Gaussian noise, when we are given samples from X=AS+\eta, where \eta is drawn from an unknown, not necessarily spherical n-dimensional Gaussian distribution.

Data Skeletonization via Reeb Graphs

no code implementations NeurIPS 2011 Xiaoyin Ge, Issam I. Safa, Mikhail Belkin, Yusu Wang

While such data is often high-dimensional, it is of interest to approximate it with a low-dimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional.

Semi-supervised Learning using Sparse Eigenfunction Bases

no code implementations NeurIPS 2009 Kaushik Sinha, Mikhail Belkin

We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices.

Cannot find the paper you are looking for? You can Submit a new open access paper.