no code implementations • 15 Sep 2022 • Ehsan Amid, Rohan Anil, Christopher Fifty, Manfred K. Warmuth
In this work, we propose a novel approach for layerwise representation learning of a trained neural network.
no code implementations • 13 Feb 2022 • Ehsan Amid, Rohan Anil, Wojciech Kotłowski, Manfred K. Warmuth
We present the surprising result that randomly initialized neural networks are good feature extractors in expectation.
no code implementations • 31 Jan 2022 • Ehsan Amid, Rohan Anil, Christopher Fifty, Manfred K. Warmuth
In this paper, we update the step-size scale and the gain variables with exponentiated gradient updates instead.
1 code implementation • 11 Jun 2021 • Ehsan Amid, Rohan Anil, Manfred K. Warmuth
Second-order methods have shown state-of-the-art performance for optimizing deep neural networks.
no code implementations • 3 Apr 2021 • Negin Majidi, Ehsan Amid, Hossein Talebi, Manfred K. Warmuth
Many learning tasks in machine learning can be viewed as taking a gradient step towards minimizing the average loss of a batch of examples in each training iteration.
no code implementations • 21 Nov 2020 • Hossein Talebi, Ehsan Amid, Peyman Milanfar, Manfred K. Warmuth
Training a model on these pairwise preferences is a common deep learning approach.
no code implementations • 16 Oct 2020 • Manfred K. Warmuth, Wojciech Kotłowski, Ehsan Amid
It was conjectured that any neural network of any structure and arbitrary differentiable transfer functions at the nodes cannot learn the following problem sample efficiently when trained with gradient descent: The instances are the rows of a $d$-dimensional Hadamard matrix and the target is one of the features, i. e. very sparse.
no code implementations • NeurIPS 2020 • Ehsan Amid, Manfred K. Warmuth
We present a general framework for casting a mirror descent update as a gradient descent update on a different set of parameters.
1 code implementation • 1 Oct 2019 • Ehsan Amid, Manfred K. Warmuth
We empirically show the excellent performance of TriMap on a large variety of datasets in terms of the quality of the embedding as well as the runtime.
no code implementations • 11 Sep 2019 • Ehsan Amid, Manfred K. Warmuth
We show that Krasulina's update corresponds to a projected gradient descent step on the Stiefel manifold of the orthonormal $k$-frames, while Oja's update amounts to a gradient descent step using the unprojected gradient.
no code implementations • 8 Jul 2019 • Michał Dereziński, Manfred K. Warmuth, Daniel Hsu
We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum.
6 code implementations • NeurIPS 2019 • Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren
We introduce a temperature into the exponential function and replace the softmax output layer of neural nets by a high temperature generalization.
no code implementations • 20 Feb 2019 • Michał Kempka, Wojciech Kotłowski, Manfred K. Warmuth
We consider online learning with linear models, where the algorithm predicts on sequentially revealed instances (feature vectors), and is compared against the best linear function (comparator) in hindsight.
no code implementations • 11 Feb 2019 • Ehsan Amid, Manfred K. Warmuth
Expectation-Maximization (EM) is a prominent approach for parameter estimation of hidden (aka latent) variable models.
no code implementations • 4 Feb 2019 • Michał Dereziński, Kenneth L. Clarkson, Michael W. Mahoney, Manfred K. Warmuth
In the process, we develop a new algorithm for a joint sampling distribution called volume sampling, and we propose a new i. i. d.
no code implementations • 5 Dec 2018 • Jérémie Chalopin, Victor Chepoi, Shay Moran, Manfred K. Warmuth
On the positive side we present a new construction of an unlabeled sample compression scheme for maximum classes.
no code implementations • 4 Oct 2018 • Michał Dereziński, Manfred K. Warmuth, Daniel Hsu
Without any assumptions on the noise, the linear least squares solution for any i. i. d.
no code implementations • 6 Jun 2018 • Michał Dereziński, Manfred K. Warmuth
We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset.
no code implementations • 18 Apr 2018 • Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, Holakou Rahmanian, Manfred K. Warmuth
We study the problem of online path learning with non-additive gains, which is a central problem appearing in several applications, including ensemble structured prediction.
no code implementations • 10 Mar 2018 • Sanjay Krishna Gouda, Salil Kanetkar, David Harrison, Manfred K. Warmuth
The problem of identifying voice commands has always been a challenge due to the presence of noise and variability in speed, pitch, etc.
1 code implementation • 1 Mar 2018 • Ehsan Amid, Manfred K. Warmuth
We first show that the commonly used dimensionality reduction (DR) methods such as t-SNE and LargeVis poorly capture the global structure of the data in the low dimensional embedding.
no code implementations • NeurIPS 2018 • Michał Dereziński, Manfred K. Warmuth, Daniel Hsu
We then develop a new rescaled variant of volume sampling that produces an unbiased estimate which avoids this bad behavior and has at least as good a tail bound as leverage score sampling: sample size $k=O(d\log d + d/\epsilon)$ suffices to guarantee total loss at most $1+\epsilon$ times the minimum with high probability.
no code implementations • 14 Oct 2017 • Michał Dereziński, Manfred K. Warmuth
However, when labels are expensive, we are forced to select only a small subset of vectors $\mathbf{x}_i$ for which we obtain the labels $y_i$.
no code implementations • NeurIPS 2017 • Holakou Rahmanian, Manfred K. Warmuth
We consider the problem of repeatedly solving a variant of the same dynamic programming problem in successive trials.
no code implementations • NeurIPS 2017 • Michał Dereziński, Manfred K. Warmuth
Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of $X$.
no code implementations • 19 May 2017 • Ehsan Amid, Manfred K. Warmuth, Sriram Srinivasan
We explain this by showing that $t_1 < 1$ caps the surrogate loss and $t_2 >1$ makes the predictive distribution have a heavy tail.
no code implementations • 30 Nov 2016 • Ehsan Amid, Nikos Vlassis, Manfred K. Warmuth
We describe a new method called t-ETE for finding a low-dimensional embedding of a set of objects in Euclidean space.
no code implementations • 16 Jun 2015 • Wojciech Kotłowski, Manfred K. Warmuth
We develop a simple algorithm that needs $O(kn^2)$ per trial whose regret is off by a small factor of $O(n^{1/4})$.
no code implementations • 30 May 2015 • Shay Moran, Manfred K. Warmuth
We consider a generalization of maximum classes called extremal classes.
no code implementations • NeurIPS 2014 • Michal Derezinski, Manfred K. Warmuth
We conjecture that our hardness results hold for any training algorithm that is based on the squared Euclidean distance regularization (i. e. Back-propagation with the Weight Decay heuristic).
no code implementations • 9 Aug 2014 • Manfred K. Warmuth, Dima Kuzmin
Finite probability distributions are a special case where the density matrix is restricted to be diagonal.
no code implementations • 17 Jun 2013 • Jiazhong Nie, Wojciech Kotlowski, Manfred K. Warmuth
Furthermore, we show that when considering regret bounds as function of a loss budget, EG remains optimal and strictly outperforms GD.
no code implementations • NeurIPS 2012 • Dmitry Adamskiy, Manfred K. Warmuth, Wouter M. Koolen
If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart).
no code implementations • NeurIPS 2011 • Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth
In this extension, the alphabet of $n$ outcomes is replaced by the set of all dyads, i. e. outer products $\u\u^\top$ where $\u$ is a vector in $\R^n$ of unit length.
no code implementations • NeurIPS 2010 • Jacob D. Abernethy, Manfred K. Warmuth
We study repeated zero-sum games against an adversary on a budget.