no code implementations • 21 Feb 2024 • Daniel Beaglehole, Peter Súkeník, Marco Mondelli, Mikhail Belkin

In this work, we provide substantial evidence that DNC formation occurs primarily through deep feature learning with the average gradient outer product (AGOP).

1 code implementation • 15 Feb 2024 • Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić

Our results demonstrate the usefulness of this approach across different models and sizes, and also with parameter-efficient fine-tuning, offering a novel pathway to addressing the challenges with private and sensitive data in LLM applications.

1 code implementation • 9 Jan 2024 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Dmitriy Drusvyatskiy

A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning.

no code implementations • 6 Dec 2023 • Amirhesam Abedsoltan, Parthe Pandit, Luis Rademacher, Mikhail Belkin

Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning.

no code implementations • 24 Nov 2023 • James B. Simon, Dhruva Karkada, Nikhil Ghosh, Mikhail Belkin

In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better.

1 code implementation • 1 Sep 2023 • Daniel Beaglehole, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin

We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines.

no code implementations • 7 Jun 2023 • Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD).

no code implementations • 5 Jun 2023 • Chaoyue Liu, Amirhesam Abedsoltan, Mikhail Belkin

This behaviour is believed to be a result of neural networks learning the pattern of clean data first and fitting the noise later in the training, a phenomenon that we refer to as clean-priority learning.

no code implementations • 8 Feb 2023 • Like Hui, Mikhail Belkin, Stephen Wright

We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy.

1 code implementation • 6 Feb 2023 • Amirhesam Abedsoltan, Mikhail Belkin, Parthe Pandit

Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets.

3 code implementations • 28 Dec 2022 • Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin

In recent years neural networks have achieved impressive results on many technological and scientific tasks.

no code implementations • 29 Sep 2022 • Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin

Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $\Omega(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss.

no code implementations • 23 Jul 2022 • Nikhil Ghosh, Mikhail Belkin

Remarkably, while the Marchenko-Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, it coincides exactly with the distribution independent bound as the level of overparametrization increases.

no code implementations • 14 Jul 2022 • Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran

In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime.

no code implementations • 30 Jun 2022 • Libin Zhu, Parthe Pandit, Mikhail Belkin

In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius $O(1)$ around initialization.

no code implementations • 26 May 2022 • Daniel Beaglehole, Mikhail Belkin, Parthe Pandit

``Benign overfitting'', the ability of certain algorithms to interpolate noisy training data and yet perform well out-of-sample, has been a topic of considerable recent interest.

1 code implementation • 24 May 2022 • Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models.

no code implementations • 24 May 2022 • Libin Zhu, Chaoyue Liu, Mikhail Belkin

In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity.

no code implementations • 29 Apr 2022 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

In this work, we identify and construct an explicit set of neural network classifiers that achieve optimality.

no code implementations • ICLR 2022 • Chaoyue Liu, Libin Zhu, Mikhail Belkin

Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK), in a region containing the optimization path of gradient descent.

no code implementations • 17 Feb 2022 • Like Hui, Mikhail Belkin, Preetum Nakkiran

We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property).

no code implementations • 14 Feb 2022 • Yuan Cao, Zixiang Chen, Mikhail Belkin, Quanquan Gu

In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN).

no code implementations • 30 Dec 2021 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice.

1 code implementation • 31 Jul 2021 • Adityanarayanan Radhakrishnan, George Stefanakis, Mikhail Belkin, Caroline Uhler

Remarkably, taking the width of a neural network to infinity allows for improved computational performance.

1 code implementation • 29 May 2021 • Mikhail Belkin

In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges.

no code implementations • NeurIPS 2021 • Yuan Cao, Quanquan Gu, Mikhail Belkin

In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems.

no code implementations • NeurIPS 2020 • Chaoyue Liu, Libin Zhu, Mikhail Belkin

We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width.

no code implementations • 28 Sep 2020 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum?

no code implementations • 18 Sep 2020 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad.

no code implementations • NeurIPS 2021 • Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi

This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized.

no code implementations • ICLR 2021 • Like Hui, Mikhail Belkin

We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources.

Automatic Speech Recognition
Automatic Speech Recognition (ASR)
**+2**

no code implementations • 16 May 2020 • Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, Anant Sahai

We compare classification and regression tasks in an overparameterized linear model with Gaussian features.

no code implementations • 29 Feb 2020 • Chaoyue Liu, Libin Zhu, Mikhail Belkin

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks.

1 code implementation • 26 Sep 2019 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Identifying computational mechanisms for memorization and retrieval of data is a long-standing problem at the intersection of machine learning and neuroscience.

no code implementations • 25 Sep 2019 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Identifying computational mechanisms for memorization and retrieval is a long-standing problem at the intersection of machine learning and neuroscience.

no code implementations • ICLR 2019 • Adityanarayanan Radhakrishnan, Caroline Uhler, Mikhail Belkin

In this paper, we link memorization of images in deep convolutional autoencoders to downsampling through strided convolution.

no code implementations • 18 Mar 2019 • Mikhail Belkin, Daniel Hsu, Ji Xu

The "double descent" risk curve was proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models.

2 code implementations • 28 Dec 2018 • Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal

This connection between the performance and the structure of machine learning models delineates the limits of classical analyses, and has implications for both the theory and practice of machine learning.

no code implementations • 6 Nov 2018 • Raef Bassily, Mikhail Belkin, Siyuan Ma

Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning.

no code implementations • 6 Nov 2018 • Like Hui, Siyuan Ma, Mikhail Belkin

We apply a fast kernel method for mask-based single-channel speech enhancement.

1 code implementation • ICLR 2020 • Chaoyue Liu, Mikhail Belkin

This is in contrast to the classical results in the deterministic scenario, where the same step size ensures accelerated convergence of the Nesterov's method over optimal gradient descent.

no code implementations • ICML Workshop Deep_Phenomen 2019 • Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler

The ability of deep neural networks to generalize well in the overparameterized regime has become a subject of significant research interest.

no code implementations • 25 Jun 2018 • Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov

We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.

2 code implementations • 15 Jun 2018 • Siyuan Ma, Mikhail Belkin

In this paper we develop the first analytical framework that extends linear scaling to match the parallel computing capacity of a resource.

no code implementations • NeurIPS 2018 • Mikhail Belkin, Daniel Hsu, Partha Mitra

Finally, this paper suggests a way to explain the phenomenon of adversarial examples, which are seemingly ubiquitous in modern machine learning, and also discusses some connections to kernel machines and random forests in the interpolated regime.

no code implementations • 28 Feb 2018 • Chaoyue Liu, Mikhail Belkin

Analyses of accelerated (momentum-based) gradient descent usually assume bounded condition number to obtain exponential convergence rates.

no code implementations • 12 Feb 2018 • Akshay Mehra, Jihun Hamm, Mikhail Belkin

Active learning reduces the number of user interactions by querying the labels of the most informative points and GSSL allows to use abundant unlabeled data along with the limited labeled data provided by the user.

no code implementations • ICML 2018 • Mikhail Belkin, Siyuan Ma, Soumik Mandal

Certain key phenomena of deep learning are manifested similarly in kernel methods in the modern "overfitted" regime.

no code implementations • 10 Jan 2018 • Mikhail Belkin

We analyze eigenvalue decay of kernels operators and matrices, properties of eigenfunctions/eigenvectors and "Fourier" coefficients of functions in the kernel space restricted to a discrete set of data points.

no code implementations • ICML 2018 • Siyuan Ma, Raef Bassily, Mikhail Belkin

We show that there is a critical batch size $m^*$ such that: (a) SGD iteration with mini-batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of mini-batch size $1$ (\emph{linear scaling regime}).

no code implementations • 20 Jun 2017 • Justin Eldridge, Mikhail Belkin, Yusu Wang

Classical matrix perturbation results, such as Weyl's theorem for eigenvalues and the Davis-Kahan theorem for eigenvectors, are general purpose.

1 code implementation • NeurIPS 2017 • Siyuan Ma, Mikhail Belkin

An analysis based on the spectral properties of the kernel demonstrates that only a vanishingly small portion of the function space is reachable after a polynomial number of gradient descent iterations.

no code implementations • NeurIPS 2016 • Chaoyue Liu, Mikhail Belkin

Clustering, in particular $k$-means clustering, is a central topic in data analysis.

no code implementations • NeurIPS 2016 • Justin Eldridge, Mikhail Belkin, Yusu Wang

In this work we develop a theory of hierarchical clustering for graphs.

no code implementations • 10 Feb 2016 • Jihun Hamm, Paul Cao, Mikhail Belkin

How can we build an accurate and differentially private global classifier by combining locally-trained classifiers from different parties, without access to any party's private data?

no code implementations • 21 Jun 2015 • Justin Eldridge, Mikhail Belkin, Yusu Wang

In this paper we identify two limit properties, separation and minimality, which address both over-segmentation and improper nesting and together imply (but are not implied by) Hartigan consistency.

no code implementations • 27 Feb 2015 • Jihun Hamm, Mikhail Belkin

In this paper we propose a non-metric ranking-based representation of semantic similarity that allows natural aggregation of semantic information from multiple heterogeneous sources.

no code implementations • NeurIPS 2015 • James Voss, Mikhail Belkin, Luis Rademacher

We propose a new algorithm, PEGI (for pseudo-Euclidean Gradient Iteration), for provable model recovery for ICA with Gaussian noise.

no code implementations • 11 Jan 2015 • Jihun Hamm, Adam Champion, Guoxing Chen, Mikhail Belkin, Dong Xuan

Smart devices with built-in sensors, computational capabilities, and network connectivity have become increasingly pervasive.

no code implementations • NeurIPS 2014 • Qichao Que, Mikhail Belkin, Yusu Wang

In this paper we propose a framework for supervised and semi-supervised learning based on reformulating the learning problem as a regularized Fredholm integral equation.

no code implementations • 5 Nov 2014 • Mikhail Belkin, Luis Rademacher, James Voss

It includes influential Machine Learning methods such as cumulant-based FastICA and the tensor power iteration for orthogonally decomposable tensors as special cases.

1 code implementation • 4 Mar 2014 • James Voss, Mikhail Belkin, Luis Rademacher

Geometrically, the proposed algorithms can be interpreted as hidden basis recovery by means of function optimization.

no code implementations • NeurIPS 2013 • James R. Voss, Luis Rademacher, Mikhail Belkin

In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise.

no code implementations • 12 Nov 2013 • Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, James Voss

The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture.

no code implementations • NeurIPS 2013 • Qichao Que, Mikhail Belkin

In this paper we address the problem of estimating the ratio $\frac{q}{p}$ where $p$ is a density function and $q$ is another density, or, more generally an arbitrary function.

no code implementations • 7 Nov 2012 • Mikhail Belkin, Luis Rademacher, James Voss

In this paper we propose a new algorithm for solving the blind signal separation problem in the presence of additive Gaussian noise, when we are given samples from X=AS+\eta, where \eta is drawn from an unknown, not necessarily spherical n-dimensional Gaussian distribution.

no code implementations • NeurIPS 2011 • Xiaoyin Ge, Issam I. Safa, Mikhail Belkin, Yusu Wang

While such data is often high-dimensional, it is of interest to approximate it with a low-dimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional.

no code implementations • NeurIPS 2009 • Kaushik Sinha, Mikhail Belkin

We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices.

Cannot find the paper you are looking for? You can
Submit a new open access paper.

Contact us on:
hello@paperswithcode.com
.
Papers With Code is a free resource with all data licensed under CC-BY-SA.