A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning.
We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines.
In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD).
In recent years neural networks have achieved impressive results on many technological and scientific tasks.
In this work, we propose a transfer learning framework for kernel methods by projecting and translating the source model to the target task.
While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models.
In this work, we identify and construct an explicit set of neural network classifiers that achieve optimality.
Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice.
Remarkably, taking the width of a neural network to infinity allows for improved computational performance.
Aligned latent spaces, where meaningful semantic shifts in the input space correspond to a translation in the embedding space, play an important role in the success of downstream tasks such as unsupervised clustering and data imputation.
While deep networks have produced state-of-the-art results in several domains from image classification to machine translation, hyper-parameter selection remains a significant computational bottleneck.
We then present a novel linear regression framework for characterizing the impact of depth on test risk, and show that increasing depth leads to a U-shaped test risk for the linear CNTK.
The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum?
Recent work provided an explanation for this phenomenon by introducing the double descent curve, showing that increasing model capacity past the interpolation threshold leads to a decrease in test error.
GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad.
We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018.
Identifying computational mechanisms for memorization and retrieval of data is a long-standing problem at the intersection of machine learning and neuroscience.
Identifying computational mechanisms for memorization and retrieval is a long-standing problem at the intersection of machine learning and neuroscience.
In this paper, we link memorization of images in deep convolutional autoencoders to downsampling through strided convolution.
The ability of deep neural networks to generalize well in the overparameterized regime has become a subject of significant research interest.
Understanding how a complex machine learning model makes a classification decision is essential for its acceptance in sensitive areas such as health care.