Search Results for author: Mikhail Belkin

Found 68 papers, 14 papers with code

Semi-supervised Learning using Sparse Eigenfunction Bases

no code implementations • NeurIPS 2009 • Kaushik Sinha, Mikhail Belkin

We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices.

Paper
Add Code

Data Skeletonization via Reeb Graphs

no code implementations • NeurIPS 2011 • Xiaoyin Ge, Issam I. Safa, Mikhail Belkin, Yusu Wang

While such data is often high-dimensional, it is of interest to approximate it with a low-dimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional.

Paper
Add Code

Blind Signal Separation in the Presence of Gaussian Noise

no code implementations • 7 Nov 2012 • Mikhail Belkin, Luis Rademacher, James Voss

In this paper we propose a new algorithm for solving the blind signal separation problem in the presence of additive Gaussian noise, when we are given samples from X=AS+\eta, where \eta is drawn from an unknown, not necessarily spherical n-dimensional Gaussian distribution.

Paper
Add Code

Inverse Density as an Inverse Problem: The Fredholm Equation Approach

no code implementations • NeurIPS 2013 • Qichao Que, Mikhail Belkin

In this paper we address the problem of estimating the ratio $\frac{q}{p}$ where $p$ is a density function and $q$ is another density, or, more generally an arbitrary function.

Transfer Learning

Paper
Add Code

The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

no code implementations • 12 Nov 2013 • Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, James Voss

The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture.

Paper
Add Code

Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

no code implementations • NeurIPS 2013 • James R. Voss, Luis Rademacher, Mikhail Belkin

In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise.

Paper
Add Code

The Hidden Convexity of Spectral Clustering

1 code implementation • 4 Mar 2014 • James Voss, Mikhail Belkin, Luis Rademacher

Geometrically, the proposed algorithms can be interpreted as hidden basis recovery by means of function optimization.

Clustering

Paper
Code

Eigenvectors of Orthogonally Decomposable Functions

no code implementations • 5 Nov 2014 • Mikhail Belkin, Luis Rademacher, James Voss

It includes influential Machine Learning methods such as cumulant-based FastICA and the tensor power iteration for orthogonally decomposable tensors as special cases.

Clustering Topic Models

Paper
Add Code

Learning with Fredholm Kernels

no code implementations • NeurIPS 2014 • Qichao Que, Mikhail Belkin, Yusu Wang

In this paper we propose a framework for supervised and semi-supervised learning based on reformulating the learning problem as a regularized Fredholm integral equation.

Paper
Add Code

Crowd-ML: A Privacy-Preserving Learning Framework for a Crowd of Smart Devices

no code implementations • 11 Jan 2015 • Jihun Hamm, Adam Champion, Guoxing Chen, Mikhail Belkin, Dong Xuan

Smart devices with built-in sensors, computational capabilities, and network connectivity have become increasingly pervasive.

Privacy Preserving

Paper
Add Code

A Pseudo-Euclidean Iteration for Optimal Recovery in Noisy ICA

no code implementations • NeurIPS 2015 • James Voss, Mikhail Belkin, Luis Rademacher

We propose a new algorithm, PEGI (for pseudo-Euclidean Gradient Iteration), for provable model recovery for ICA with Gaussian noise.

Paper
Add Code

Probabilistic Zero-shot Classification with Semantic Rankings

no code implementations • 27 Feb 2015 • Jihun Hamm, Mikhail Belkin

In this paper we propose a non-metric ranking-based representation of semantic similarity that allows natural aggregation of semantic information from multiple heterogeneous sources.

Classification General Classification +3

Paper
Add Code

Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering

no code implementations • 21 Jun 2015 • Justin Eldridge, Mikhail Belkin, Yusu Wang

In this paper we identify two limit properties, separation and minimality, which address both over-segmentation and improper nesting and together imply (but are not implied by) Hartigan consistency.

Clustering

Paper
Add Code

Learning Privately from Multiparty Data

no code implementations • 10 Feb 2016 • Jihun Hamm, Paul Cao, Mikhail Belkin

How can we build an accurate and differentially private global classifier by combining locally-trained classifiers from different parties, without access to any party's private data?

Activity Recognition Network Intrusion Detection

Paper
Add Code

Graphons, mergeons, and so on!

no code implementations • NeurIPS 2016 • Justin Eldridge, Mikhail Belkin, Yusu Wang

In this work we develop a theory of hierarchical clustering for graphs.

Clustering Graph Clustering

Paper
Add Code

Clustering with Bregman Divergences: an Asymptotic Analysis

no code implementations • NeurIPS 2016 • Chaoyue Liu, Mikhail Belkin

Clustering, in particular $k$-means clustering, is a central topic in data analysis.

Clustering Quantization

Paper
Add Code

Diving into the shallows: a computational perspective on large-scale shallow learning

1 code implementation • NeurIPS 2017 • Siyuan Ma, Mikhail Belkin

An analysis based on the spectral properties of the kernel demonstrates that only a vanishingly small portion of the function space is reachable after a polynomial number of gradient descent iterations.

Paper
Code

Unperturbed: spectral analysis beyond Davis-Kahan

no code implementations • 20 Jun 2017 • Justin Eldridge, Mikhail Belkin, Yusu Wang

Classical matrix perturbation results, such as Weyl's theorem for eigenvalues and the Davis-Kahan theorem for eigenvectors, are general purpose.

Clustering

Paper
Add Code

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

no code implementations • ICML 2018 • Siyuan Ma, Raef Bassily, Mikhail Belkin

We show that there is a critical batch size $m^*$ such that: (a) SGD iteration with mini-batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of mini-batch size $1$ (\emph{linear scaling regime}).

Paper
Add Code

Approximation beats concentration? An approximation view on inference with smooth radial kernels

no code implementations • 10 Jan 2018 • Mikhail Belkin

We analyze eigenvalue decay of kernels operators and matrices, properties of eigenfunctions/eigenvectors and "Fourier" coefficients of functions in the kernel space restricted to a discrete set of data points.

Paper
Add Code

To understand deep learning we need to understand kernel learning

no code implementations • ICML 2018 • Mikhail Belkin, Siyuan Ma, Soumik Mandal

Certain key phenomena of deep learning are manifested similarly in kernel methods in the modern "overfitted" regime.

Generalization Bounds

Paper
Add Code

Fast Interactive Image Retrieval using large-scale unlabeled data

no code implementations • 12 Feb 2018 • Akshay Mehra, Jihun Hamm, Mikhail Belkin

Active learning reduces the number of user interactions by querying the labels of the most informative points and GSSL allows to use abundant unlabeled data along with the limited labeled data provided by the user.

Active Learning Binary Classification +2

Paper
Add Code

Parametrized Accelerated Methods Free of Condition Number

no code implementations • 28 Feb 2018 • Chaoyue Liu, Mikhail Belkin

Analyses of accelerated (momentum-based) gradient descent usually assume bounded condition number to obtain exponential convergence rates.

Paper
Add Code

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

no code implementations • NeurIPS 2018 • Mikhail Belkin, Daniel Hsu, Partha Mitra

Finally, this paper suggests a way to explain the phenomenon of adversarial examples, which are seemingly ubiquitous in modern machine learning, and also discusses some connections to kernel machines and random forests in the interpolated regime.

BIG-bench Machine Learning General Classification +1

Paper
Add Code

Kernel machines that adapt to GPUs for effective large batch training

2 code implementations • 15 Jun 2018 • Siyuan Ma, Mikhail Belkin

In this paper we develop the first analytical framework that extends linear scaling to match the parallel computing capacity of a resource.

Paper
Code

Does data interpolation contradict statistical optimality?

no code implementations • 25 Jun 2018 • Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov

We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.

regression

Paper
Add Code

Memorization in Overparameterized Autoencoders

no code implementations • ICML Workshop Deep_Phenomen 2019 • Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler

The ability of deep neural networks to generalize well in the overparameterized regime has become a subject of significant research interest.

Inductive Bias Memorization

Paper
Add Code

Accelerating SGD with momentum for over-parameterized learning

1 code implementation • ICLR 2020 • Chaoyue Liu, Mikhail Belkin

This is in contrast to the classical results in the deterministic scenario, where the same step size ensures accelerated convergence of the Nesterov's method over optimal gradient descent.

Paper
Code

Kernel Machines Beat Deep Neural Networks on Mask-based Single-channel Speech Enhancement

no code implementations • 6 Nov 2018 • Like Hui, Siyuan Ma, Mikhail Belkin

We apply a fast kernel method for mask-based single-channel speech enhancement.

regression Speech Enhancement

Paper
Add Code

On exponential convergence of SGD in non-convex over-parametrized learning

no code implementations • 6 Nov 2018 • Raef Bassily, Mikhail Belkin, Siyuan Ma

Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning.

BIG-bench Machine Learning

Paper
Add Code

Reconciling modern machine learning practice and the bias-variance trade-off

3 code implementations • 28 Dec 2018 • Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal

This connection between the performance and the structure of machine learning models delineates the limits of classical analyses, and has implications for both the theory and practice of machine learning.

BIG-bench Machine Learning

Paper
Code

Two models of double descent for weak features

no code implementations • 18 Mar 2019 • Mikhail Belkin, Daniel Hsu, Ji Xu

The "double descent" risk curve was proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models.

BIG-bench Machine Learning Vocal Bursts Valence Prediction

Paper
Add Code

Downsampling leads to Image Memorization in Convolutional Autoencoders

no code implementations • ICLR 2019 • Adityanarayanan Radhakrishnan, Caroline Uhler, Mikhail Belkin

In this paper, we link memorization of images in deep convolutional autoencoders to downsampling through strided convolution.

Memorization

Paper
Add Code

Overparameterized Neural Networks Can Implement Associative Memory

no code implementations • 25 Sep 2019 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Identifying computational mechanisms for memorization and retrieval is a long-standing problem at the intersection of machine learning and neuroscience.

Memorization Retrieval

Paper
Add Code

Overparameterized Neural Networks Implement Associative Memory

1 code implementation • 26 Sep 2019 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Identifying computational mechanisms for memorization and retrieval of data is a long-standing problem at the intersection of machine learning and neuroscience.

Memorization Retrieval

Paper
Code

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

no code implementations • 29 Feb 2020 • Chaoyue Liu, Libin Zhu, Mikhail Belkin

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks.

Paper
Add Code

Classification vs regression in overparameterized regimes: Does the loss function matter?

no code implementations • 16 May 2020 • Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, Anant Sahai

We compare classification and regression tasks in an overparameterized linear model with Gaussian features.

General Classification regression

Paper
Add Code

Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

no code implementations • ICLR 2021 • Like Hui, Mikhail Belkin

We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Multiple Descent: Design Your Own Generalization Curve

no code implementations • NeurIPS 2021 • Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi

This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized.

regression

Paper
Add Code

Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors

no code implementations • 18 Sep 2020 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad.

Paper
Add Code

Linear Convergence and Implicit Regularization of Generalized Mirror Descent with Time-Dependent Mirrors

no code implementations • 28 Sep 2020 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum?

Paper
Add Code

On the linearity of large non-linear models: when and why the tangent kernel is constant

no code implementations • NeurIPS 2020 • Chaoyue Liu, Libin Zhu, Mikhail Belkin

We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width.

Paper
Add Code

Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures

no code implementations • NeurIPS 2021 • Yuan Cao, Quanquan Gu, Mikhail Belkin

In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems.

Classification General Classification +1

Paper
Add Code

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

1 code implementation • 29 May 2021 • Mikhail Belkin

In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges.

BIG-bench Machine Learning

Paper
Code

Simple, Fast, and Flexible Framework for Matrix Completion with Infinite Width Neural Networks

1 code implementation • 31 Jul 2021 • Adityanarayanan Radhakrishnan, George Stefanakis, Mikhail Belkin, Caroline Uhler

Remarkably, taking the width of a neural network to infinity allows for improved computational performance.

Image Inpainting Matrix Completion +1

Paper
Code

Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size

no code implementations • 30 Dec 2021 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice.

Paper
Add Code

Benign Overfitting in Two-layer Convolutional Neural Networks

no code implementations • 14 Feb 2022 • Yuan Cao, Zixiang Chen, Mikhail Belkin, Quanquan Gu

In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN).

Vocal Bursts Valence Prediction

Paper
Add Code

Limitations of Neural Collapse for Understanding Generalization in Deep Learning

no code implementations • 17 Feb 2022 • Like Hui, Mikhail Belkin, Preetum Nakkiran

We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property).

Representation Learning

Paper
Add Code

Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models

no code implementations • ICLR 2022 • Chaoyue Liu, Libin Zhu, Mikhail Belkin

Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK), in a region containing the optimization path of gradient descent.

Paper
Add Code

Wide and Deep Neural Networks Achieve Optimality for Classification

no code implementations • 29 Apr 2022 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

In this work, we identify and construct an explicit set of neural network classifiers that achieve optimality.

Classification

Paper
Add Code

Quadratic models for understanding neural network dynamics

1 code implementation • 24 May 2022 • Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models.

Paper
Code

Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture

no code implementations • 24 May 2022 • Libin Zhu, Chaoyue Liu, Mikhail Belkin

In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity.

Paper
Add Code

On the Inconsistency of Kernel Ridgeless Regression in Fixed Dimensions

no code implementations • 26 May 2022 • Daniel Beaglehole, Mikhail Belkin, Parthe Pandit

``Benign overfitting'', the ability of certain algorithms to interpolate noisy training data and yet perform well out-of-sample, has been a topic of considerable recent interest.

regression Translation

Paper
Add Code

A note on Linear Bottleneck networks and their Transition to Multilinearity

no code implementations • 30 Jun 2022 • Libin Zhu, Parthe Pandit, Mikhail Belkin

In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius $O(1)$ around initialization.

Paper
Add Code

Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting

no code implementations • 14 Jul 2022 • Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran

In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime.

Learning Theory

Paper
Add Code

A Universal Trade-off Between the Model Size, Test Loss, and Training Loss of Linear Predictors

no code implementations • 23 Jul 2022 • Nikhil Ghosh, Mikhail Belkin

Remarkably, while the Marchenko-Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, it coincides exactly with the distribution independent bound as the level of overparametrization increases.

Paper
Add Code

Restricted Strong Convexity of Deep Learning Models with Smooth Activations

no code implementations • 29 Sep 2022 • Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin

Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $\Omega(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss.

Paper
Add Code

Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features

3 code implementations • 28 Dec 2022 • Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin

In recent years neural networks have achieved impressive results on many technological and scientific tasks.

Paper
Code

Toward Large Kernel Models

1 code implementation • 6 Feb 2023 • Amirhesam Abedsoltan, Mikhail Belkin, Parthe Pandit

Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets.

Paper
Code

Cut your Losses with Squentropy

no code implementations • 8 Feb 2023 • Like Hui, Mikhail Belkin, Stephen Wright

We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy.

Classification Multi-class Classification

Paper
Add Code

On Emergence of Clean-Priority Learning in Early Stopped Neural Networks

no code implementations • 5 Jun 2023 • Chaoyue Liu, Amirhesam Abedsoltan, Mikhail Belkin

This behaviour is believed to be a result of neural networks learning the pattern of clean data first and fitting the noise later in the training, a phenomenon that we refer to as clean-priority learning.

Paper
Add Code

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

no code implementations • 7 Jun 2023 • Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD).

Paper
Add Code

Mechanism of feature learning in convolutional neural networks

1 code implementation • 1 Sep 2023 • Daniel Beaglehole, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin

We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines.

Paper
Code

More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

no code implementations • 24 Nov 2023 • James B. Simon, Dhruva Karkada, Nikhil Ghosh, Mikhail Belkin

In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better.

Philosophy regression

Paper
Add Code

On the Nystrom Approximation for Preconditioning in Kernel Machines

no code implementations • 6 Dec 2023 • Amirhesam Abedsoltan, Parthe Pandit, Luis Rademacher, Mikhail Belkin

Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning.

Paper
Add Code

Linear Recursive Feature Machines provably recover low-rank matrices

1 code implementation • 9 Jan 2024 • Adityanarayanan Radhakrishnan, Mikhail Belkin, Dmitriy Drusvyatskiy

A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning.

Dimensionality Reduction Low-Rank Matrix Completion +1

Paper
Code

Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination

1 code implementation • 15 Feb 2024 • Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić

Our results demonstrate the usefulness of this approach across different models and sizes, and also with parameter-efficient fine-tuning, offering a novel pathway to addressing the challenges with private and sensitive data in LLM applications.

Natural Language Understanding

Paper
Code

Average gradient outer product as a mechanism for deep neural collapse

no code implementations • 21 Feb 2024 • Daniel Beaglehole, Peter Súkeník, Marco Mondelli, Mikhail Belkin

In this work, we provide substantial evidence that DNC formation occurs primarily through deep feature learning with the average gradient outer product (AGOP).

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.