no code implementations • 8 Feb 2023 • Like Hui, Mikhail Belkin, Stephen Wright
We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy.
1 code implementation • 6 Feb 2023 • Amirhesam Abedsoltan, Mikhail Belkin, Parthe Pandit
Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets.
2 code implementations • 28 Dec 2022 • Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin
In this work, we isolate a key mechanism driving feature learning in fully connected neural networks by connecting neural feature learning to a statistical estimator known as average gradient outer product.
no code implementations • 29 Sep 2022 • Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin
Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $\Omega(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss.
no code implementations • 23 Jul 2022 • Nikhil Ghosh, Mikhail Belkin
Remarkably, while the Marchenko-Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, in settings of most practical interest it differs from the distribution independent bound by only a modest multiplicative constant.
no code implementations • 14 Jul 2022 • Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran
In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime.
no code implementations • 30 Jun 2022 • Libin Zhu, Parthe Pandit, Mikhail Belkin
In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius $O(1)$ around initialization.
no code implementations • 26 May 2022 • Daniel Beaglehole, Mikhail Belkin, Parthe Pandit
We show that kernel interpolation for a large class of shift-invariant kernels is inconsistent in fixed dimension, even with bandwidth adaptive to the training set.
no code implementations • 24 May 2022 • Libin Zhu, Chaoyue Liu, Mikhail Belkin
In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity.
no code implementations • 24 May 2022 • Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin
In this work, we propose using a quadratic model as a tool for understanding properties of wide neural networks in both optimization and generalization.
no code implementations • 29 Apr 2022 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler
In this work, we identify and construct an explicit set of neural network classifiers that achieve optimality.
no code implementations • ICLR 2022 • Chaoyue Liu, Libin Zhu, Mikhail Belkin
Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK), in a region containing the optimization path of gradient descent.
no code implementations • 17 Feb 2022 • Like Hui, Mikhail Belkin, Preetum Nakkiran
We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property).
no code implementations • 14 Feb 2022 • Yuan Cao, Zixiang Chen, Mikhail Belkin, Quanquan Gu
In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN).
no code implementations • 30 Dec 2021 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler
Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice.
1 code implementation • 31 Jul 2021 • Adityanarayanan Radhakrishnan, George Stefanakis, Mikhail Belkin, Caroline Uhler
Remarkably, taking the width of a neural network to infinity allows for improved computational performance.
1 code implementation • 29 May 2021 • Mikhail Belkin
In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges.
no code implementations • NeurIPS 2021 • Yuan Cao, Quanquan Gu, Mikhail Belkin
In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems.
no code implementations • NeurIPS 2020 • Chaoyue Liu, Libin Zhu, Mikhail Belkin
We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width.
no code implementations • 28 Sep 2020 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler
The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum?
no code implementations • 18 Sep 2020 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler
GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad.
no code implementations • NeurIPS 2021 • Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi
This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized.
no code implementations • ICLR 2021 • Like Hui, Mikhail Belkin
We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 16 May 2020 • Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, Anant Sahai
We compare classification and regression tasks in an overparameterized linear model with Gaussian features.
no code implementations • 29 Feb 2020 • Chaoyue Liu, Libin Zhu, Mikhail Belkin
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks.
1 code implementation • 26 Sep 2019 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler
Identifying computational mechanisms for memorization and retrieval of data is a long-standing problem at the intersection of machine learning and neuroscience.
no code implementations • 25 Sep 2019 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler
Identifying computational mechanisms for memorization and retrieval is a long-standing problem at the intersection of machine learning and neuroscience.
no code implementations • ICLR 2019 • Adityanarayanan Radhakrishnan, Caroline Uhler, Mikhail Belkin
In this paper, we link memorization of images in deep convolutional autoencoders to downsampling through strided convolution.
no code implementations • 18 Mar 2019 • Mikhail Belkin, Daniel Hsu, Ji Xu
The "double descent" risk curve was proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models.
3 code implementations • 28 Dec 2018 • Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal
This connection between the performance and the structure of machine learning models delineates the limits of classical analyses, and has implications for both the theory and practice of machine learning.
no code implementations • 6 Nov 2018 • Like Hui, Siyuan Ma, Mikhail Belkin
We apply a fast kernel method for mask-based single-channel speech enhancement.
no code implementations • 6 Nov 2018 • Raef Bassily, Mikhail Belkin, Siyuan Ma
Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning.
1 code implementation • ICLR 2020 • Chaoyue Liu, Mikhail Belkin
This is in contrast to the classical results in the deterministic scenario, where the same step size ensures accelerated convergence of the Nesterov's method over optimal gradient descent.
no code implementations • ICML Workshop Deep_Phenomen 2019 • Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler
The ability of deep neural networks to generalize well in the overparameterized regime has become a subject of significant research interest.
no code implementations • 25 Jun 2018 • Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov
We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.
1 code implementation • 15 Jun 2018 • Siyuan Ma, Mikhail Belkin
In this paper we develop the first analytical framework that extends linear scaling to match the parallel computing capacity of a resource.
no code implementations • NeurIPS 2018 • Mikhail Belkin, Daniel Hsu, Partha Mitra
Finally, this paper suggests a way to explain the phenomenon of adversarial examples, which are seemingly ubiquitous in modern machine learning, and also discusses some connections to kernel machines and random forests in the interpolated regime.
no code implementations • 28 Feb 2018 • Chaoyue Liu, Mikhail Belkin
Analyses of accelerated (momentum-based) gradient descent usually assume bounded condition number to obtain exponential convergence rates.
no code implementations • 12 Feb 2018 • Akshay Mehra, Jihun Hamm, Mikhail Belkin
Active learning reduces the number of user interactions by querying the labels of the most informative points and GSSL allows to use abundant unlabeled data along with the limited labeled data provided by the user.
no code implementations • ICML 2018 • Mikhail Belkin, Siyuan Ma, Soumik Mandal
Certain key phenomena of deep learning are manifested similarly in kernel methods in the modern "overfitted" regime.
no code implementations • 10 Jan 2018 • Mikhail Belkin
We analyze eigenvalue decay of kernels operators and matrices, properties of eigenfunctions/eigenvectors and "Fourier" coefficients of functions in the kernel space restricted to a discrete set of data points.
no code implementations • ICML 2018 • Siyuan Ma, Raef Bassily, Mikhail Belkin
We show that there is a critical batch size $m^*$ such that: (a) SGD iteration with mini-batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of mini-batch size $1$ (\emph{linear scaling regime}).
no code implementations • 20 Jun 2017 • Justin Eldridge, Mikhail Belkin, Yusu Wang
Classical matrix perturbation results, such as Weyl's theorem for eigenvalues and the Davis-Kahan theorem for eigenvectors, are general purpose.
1 code implementation • NeurIPS 2017 • Siyuan Ma, Mikhail Belkin
An analysis based on the spectral properties of the kernel demonstrates that only a vanishingly small portion of the function space is reachable after a polynomial number of gradient descent iterations.
no code implementations • NeurIPS 2016 • Chaoyue Liu, Mikhail Belkin
Clustering, in particular $k$-means clustering, is a central topic in data analysis.
no code implementations • NeurIPS 2016 • Justin Eldridge, Mikhail Belkin, Yusu Wang
In this work we develop a theory of hierarchical clustering for graphs.
no code implementations • 10 Feb 2016 • Jihun Hamm, Paul Cao, Mikhail Belkin
How can we build an accurate and differentially private global classifier by combining locally-trained classifiers from different parties, without access to any party's private data?
no code implementations • 21 Jun 2015 • Justin Eldridge, Mikhail Belkin, Yusu Wang
In this paper we identify two limit properties, separation and minimality, which address both over-segmentation and improper nesting and together imply (but are not implied by) Hartigan consistency.
no code implementations • 27 Feb 2015 • Jihun Hamm, Mikhail Belkin
In this paper we propose a non-metric ranking-based representation of semantic similarity that allows natural aggregation of semantic information from multiple heterogeneous sources.
no code implementations • NeurIPS 2015 • James Voss, Mikhail Belkin, Luis Rademacher
We propose a new algorithm, PEGI (for pseudo-Euclidean Gradient Iteration), for provable model recovery for ICA with Gaussian noise.
no code implementations • 11 Jan 2015 • Jihun Hamm, Adam Champion, Guoxing Chen, Mikhail Belkin, Dong Xuan
Smart devices with built-in sensors, computational capabilities, and network connectivity have become increasingly pervasive.
no code implementations • NeurIPS 2014 • Qichao Que, Mikhail Belkin, Yusu Wang
In this paper we propose a framework for supervised and semi-supervised learning based on reformulating the learning problem as a regularized Fredholm integral equation.
no code implementations • 5 Nov 2014 • Mikhail Belkin, Luis Rademacher, James Voss
It includes influential Machine Learning methods such as cumulant-based FastICA and the tensor power iteration for orthogonally decomposable tensors as special cases.
1 code implementation • 4 Mar 2014 • James Voss, Mikhail Belkin, Luis Rademacher
Geometrically, the proposed algorithms can be interpreted as hidden basis recovery by means of function optimization.
no code implementations • NeurIPS 2013 • James R. Voss, Luis Rademacher, Mikhail Belkin
In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise.
no code implementations • 12 Nov 2013 • Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, James Voss
The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture.
no code implementations • NeurIPS 2013 • Qichao Que, Mikhail Belkin
In this paper we address the problem of estimating the ratio $\frac{q}{p}$ where $p$ is a density function and $q$ is another density, or, more generally an arbitrary function.
no code implementations • 7 Nov 2012 • Mikhail Belkin, Luis Rademacher, James Voss
In this paper we propose a new algorithm for solving the blind signal separation problem in the presence of additive Gaussian noise, when we are given samples from X=AS+\eta, where \eta is drawn from an unknown, not necessarily spherical n-dimensional Gaussian distribution.
no code implementations • NeurIPS 2011 • Xiaoyin Ge, Issam I. Safa, Mikhail Belkin, Yusu Wang
While such data is often high-dimensional, it is of interest to approximate it with a low-dimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional.
no code implementations • NeurIPS 2009 • Kaushik Sinha, Mikhail Belkin
We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices.