Search Results for author: Ohad Shamir

Found 77 papers, 6 papers with code

Convergence Results For Q-Learning With Experience Replay

no code implementations8 Dec 2021 Liran Szlak, Ohad Shamir

A commonly used heuristic in RL is experience replay (e. g.~\citet{lin1993reinforcement, mnih2015human}), in which a learner stores and re-uses past trajectories as if they were sampled online.

Q-Learning

Replay For Safety

no code implementations8 Dec 2021 Liran Szlak, Ohad Shamir

Experience replay \citep{lin1993reinforcement, mnih2015human} is a widely used technique to achieve efficient use of data and improved performance in RL algorithms.

Q-Learning

On the Optimal Memorization Power of ReLU Neural Networks

no code implementations7 Oct 2021 Gal Vardi, Gilad Yehudai, Ohad Shamir

We prove that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters.

A Stochastic Newton Algorithm for Distributed Convex Optimization

no code implementations NeurIPS 2021 Brian Bullins, Kumar Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake Woodworth

We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations performed between rounds of communication.

On Margin Maximization in Linear and ReLU Networks

no code implementations6 Oct 2021 Gal Vardi, Ohad Shamir, Nathan Srebro

Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space.

Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned Problems

1 code implementation NeurIPS 2021 Itay Safran, Ohad Shamir

Perhaps surprisingly, we prove that when the condition number is taken into account, without-replacement SGD \emph{does not} significantly improve on with-replacement SGD in terms of worst-case bounds, unless the number of epochs (passes over the data) is larger than the condition number.

Learning a Single Neuron with Bias Using Gradient Descent

no code implementations NeurIPS 2021 Gal Vardi, Gilad Yehudai, Ohad Shamir

We theoretically study the fundamental problem of learning a single neuron with a bias term ($\mathbf{x} \mapsto \sigma(<\mathbf{w},\mathbf{x}> + b)$) in the realizable setting with the ReLU activation, using gradient descent.

Oracle Complexity in Nonsmooth Nonconvex Optimization

no code implementations NeurIPS 2021 Guy Kornowski, Ohad Shamir

For this approach, we prove under a mild assumption an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very efficiently (e. g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods.

The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication

no code implementations2 Feb 2021 Blake Woodworth, Brian Bullins, Ohad Shamir, Nathan Srebro

We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates.

The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

no code implementations31 Jan 2021 Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks.

Size and Depth Separation in Approximating Benign Functions with Neural Networks

no code implementations30 Jan 2021 Gal Vardi, Daniel Reichman, Toniann Pitassi, Ohad Shamir

We show a complexity-theoretic barrier to proving such results beyond size $O(d\log^2(d))$, but also show an explicit benign function, that can be approximated with networks of size $O(d)$ and not with networks of size $o(d/\log d)$.

Implicit Regularization in ReLU Networks with the Square Loss

no code implementations9 Dec 2020 Gal Vardi, Ohad Shamir

For one hidden-layer networks, we prove a similar result, where in general it is impossible to characterize implicit regularization properties in this manner, except for the "balancedness" property identified in Du et al. [2018].

High-Order Oracle Complexity of Smooth and Strongly Convex Optimization

no code implementations13 Oct 2020 Guy Kornowski, Ohad Shamir

In this note, we consider the complexity of optimizing a highly smooth (Lipschitz $k$-th order derivative) and strongly convex function, via calls to a $k$-th order oracle which returns the value and first $k$ derivatives of the function at a given point, and where the dimension is unrestricted.

Gradient Methods Never Overfit On Separable Data

no code implementations30 Jun 2020 Ohad Shamir

A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor.

The Effects of Mild Over-parameterization on the Optimization Landscape of Shallow ReLU Neural Networks

1 code implementation1 Jun 2020 Itay Safran, Gilad Yehudai, Ohad Shamir

We prove that while the objective is strongly convex around the global minima when the teacher and student networks possess the same number of neurons, it is not even \emph{locally convex} after any amount of over-parameterization.

Neural Networks with Small Weights and Depth-Separation Barriers

no code implementations NeurIPS 2020 Gal Vardi, Ohad Shamir

To show this, we study a seemingly unrelated problem of independent interest: Namely, whether there are polynomially-bounded functions which require super-polynomial weights in order to approximate with constant-depth neural networks.

Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?

no code implementations27 Feb 2020 Ohad Shamir

It is well-known that given a bounded, smooth nonconvex function, standard gradient-based methods can find $\epsilon$-stationary points (where the gradient norm is less than $\epsilon$) in $\mathcal{O}(1/\epsilon^2)$ iterations.

Is Local SGD Better than Minibatch SGD?

no code implementations ICML 2020 Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro

We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method.

Distributed Optimization

Proving the Lottery Ticket Hypothesis: Pruning is All You Need

no code implementations ICML 2020 Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network.

Learning a Single Neuron with Gradient Methods

no code implementations15 Jan 2020 Gilad Yehudai, Ohad Shamir

We consider the fundamental problem of learning a single neuron $x \mapsto\sigma(w^\top x)$ using standard gradient methods.

The Complexity of Finding Stationary Points with Stochastic Gradient Descent

no code implementations ICML 2020 Yoel Drori, Ohad Shamir

We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions.

How Good is SGD with Random Shuffling?

no code implementations31 Jul 2019 Itay Safran, Ohad Shamir

In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions.

Depth Separations in Neural Networks: What is Actually Being Separated?

no code implementations15 Apr 2019 Itay Safran, Ronen Eldan, Ohad Shamir

Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$.

On the Power and Limitations of Random Features for Understanding Neural Networks

no code implementations NeurIPS 2019 Gilad Yehudai, Ohad Shamir

Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error).

The Complexity of Making the Gradient Small in Stochastic Convex Optimization

no code implementations13 Feb 2019 Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake Woodworth

Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model.

Stochastic Optimization

Space lower bounds for linear prediction in the streaming model

no code implementations9 Feb 2019 Yuval Dagan, Gil Kur, Ohad Shamir

We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting.

Global Non-convex Optimization with Discretized Diffusions

no code implementations NeurIPS 2018 Murat A. Erdogdu, Lester Mackey, Ohad Shamir

An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems.

Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

no code implementations23 Sep 2018 Ohad Shamir

We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots, w_k$), which arise in the context of training depth-$k$ linear neural networks.

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

no code implementations26 Jun 2018 Yossi Arjevani, Ohad Shamir, Nathan Srebro

We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $\tau$ rounds ago.

Distributed Optimization

Are ResNets Provably Better than Linear Predictors?

no code implementations NeurIPS 2018 Ohad Shamir

In this paper, we rigorously prove that arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the sense that the optimization landscape contains no local minima with value above what can be obtained with a linear predictor (namely a 1-layer network).

Detecting Correlations with Little Memory and Communication

no code implementations4 Mar 2018 Yuval Dagan, Ohad Shamir

We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines.

Spurious Local Minima are Common in Two-Layer ReLU Neural Networks

1 code implementation ICML 2018 Itay Safran, Ohad Shamir

We consider the optimization problem associated with training simple ReLU neural networks of the form $\mathbf{x}\mapsto \sum_{i=1}^{k}\max\{0,\mathbf{w}_i^\top \mathbf{x}\}$ with respect to the squared loss.

Size-Independent Sample Complexity of Neural Networks

no code implementations18 Dec 2017 Noah Golowich, Alexander Rakhlin, Ohad Shamir

We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer.

Weight Sharing is Crucial to Succesful Optimization

no code implementations2 Jun 2017 Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah

Exploiting the great expressive power of Deep Neural Network architectures, relies on the ability to train them.

Bandit Regret Scaling with the Effective Loss Range

no code implementations15 May 2017 Nicolò Cesa-Bianchi, Ohad Shamir

We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e. g. the maximal difference between two losses in a given round).

Multi-Armed Bandits

Failures of Gradient-Based Deep Learning

1 code implementation ICML 2017 Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah

In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art.

Online Learning with Local Permutations and Delayed Feedback

no code implementations ICML 2017 Ohad Shamir, Liran Szlak

In this paper, we consider the applicability of this setting to convex online learning with delayed feedback, in which the feedback on the prediction made in round $t$ arrives with some delay $\tau$.

Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis

no code implementations ICML 2017 Dan Garber, Ohad Shamir, Nathan Srebro

We study algorithms for estimating the leading principal component of the population covariance matrix that are both communication-efficient and achieve estimation error of the order of the centralized ERM solution that uses all $mn$ samples.

Without-Replacement Sampling for Stochastic Gradient Methods

no code implementations NeurIPS 2016 Ohad Shamir

Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled *with* replacement.

Distributed Optimization Learning Theory

Oracle Complexity of Second-Order Methods for Finite-Sum Problems

no code implementations ICML 2017 Yossi Arjevani, Ohad Shamir

Finite-sum optimization problems are ubiquitous in machine learning, and are commonly solved using first-order methods which rely on gradient computations.

Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks

no code implementations ICML 2017 Itay Safran, Ohad Shamir

We provide several new depth-based separation results for feed-forward neural networks, proving that various types of simple and natural functions can be better approximated using deeper networks than shallower ones, even if the shallower networks are much larger.

Distribution-Specific Hardness of Learning Neural Networks

no code implementations5 Sep 2016 Ohad Shamir

Although neural networks are routinely and successfully trained in practice using simple gradient-based methods, most existing theoretical results are negative, showing that learning such networks is difficult, in a worst-case sense over all data distributions.

Dimension-Free Iteration Complexity of Finite Sum Optimization Problems

no code implementations NeurIPS 2016 Yossi Arjevani, Ohad Shamir

Many canonical machine learning problems boil down to a convex optimization problem with a finite sum structure.

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

no code implementations11 May 2016 Yossi Arjevani, Ohad Shamir

We consider a broad class of first-order optimization algorithms which are \emph{oblivious}, in the sense that their step sizes are scheduled regardless of the function under consideration, except for limited side-information such as smoothness or strong convexity parameters.

Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization

no code implementations NeurIPS 2016 Ohad Shamir

Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled \emph{with} replacement.

Distributed Optimization Learning Theory

The Power of Depth for Feedforward Neural Networks

no code implementations12 Dec 2015 Ronen Eldan, Ohad Shamir

We show that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension.

Multi-Player Bandits -- a Musical Chairs Approach

no code implementations9 Dec 2015 Jonathan Rosenski, Ohad Shamir, Liran Szlak

We consider a variant of the stochastic multi-armed bandit problem, where multiple players simultaneously choose from the same set of arms and may collide, receiving no reward.

On the Quality of the Initial Basin in Overspecified Neural Networks

no code implementations13 Nov 2015 Itay Safran, Ohad Shamir

Deep learning, in the form of artificial neural networks, has achieved remarkable practical success in recent years, for a variety of difficult machine learning applications.

Convergence of Stochastic Gradient Descent for PCA

no code implementations30 Sep 2015 Ohad Shamir

We consider the problem of principal component analysis (PCA) in a streaming stochastic setting, where our goal is to find a direction of approximate maximal variance, based on a stream of i. i. d.

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

no code implementations31 Jul 2015 Ohad Shamir

We consider the closely related problems of bandit convex optimization with two-point feedback, and zero-order stochastic convex optimization with two function evaluations per round.

Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity

no code implementations31 Jul 2015 Ohad Shamir

We study the convergence properties of the VR-PCA algorithm introduced by \cite{shamir2015stochastic} for fast computation of leading singular vectors.

Communication Complexity of Distributed Convex Learning and Optimization

no code implementations NeurIPS 2015 Yossi Arjevani, Ohad Shamir

We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered.

On Lower and Upper Bounds for Smooth and Strongly Convex Optimization Problems

no code implementations23 Mar 2015 Yossi Arjevani, Shai Shalev-Shwartz, Ohad Shamir

This, in turn, reveals a powerful connection between a class of optimization algorithms and the analytic theory of polynomials whereby new lower and upper bounds are derived.

On the Complexity of Learning with Kernels

no code implementations5 Nov 2014 Nicolò Cesa-Bianchi, Yishay Mansour, Ohad Shamir

In this paper, we study lower bounds on the error attainable by such methods as a function of the number of entries observed in the kernel matrix or the rank of an approximate kernel matrix.

Attribute Efficient Linear Regression with Data-Dependent Sampling

no code implementations23 Oct 2014 Doron Kukliansky, Ohad Shamir

In this paper we analyze a budgeted learning setting, in which the learner can only choose and observe a small subset of the attributes of each training example.

Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback

no code implementations30 Sep 2014 Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, Ohad Shamir

This naturally models several situations where the losses of different actions are related, and knowing the loss of one action provides information on the loss of other actions.

Multi-Armed Bandits

A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate

no code implementations9 Sep 2014 Ohad Shamir

We describe and analyze a simple algorithm for principal component analysis and singular value decomposition, VR-PCA, which uses computationally cheap stochastic iterations, yet converges exponentially fast to the optimal solution.

On the Complexity of Bandit Linear Optimization

no code implementations11 Aug 2014 Ohad Shamir

We study the attainable regret for online linear optimization problems with bandit feedback, where unlike the full-information setting, the player can only observe its own loss rather than the full loss vector.

The Sample Complexity of Learning Linear Predictors with the Squared Loss

no code implementations19 Jun 2014 Ohad Shamir

In this short note, we provide a sample complexity lower bound for learning linear predictors with respect to the squared loss.

Graph Approximation and Clustering on a Budget

no code implementations10 Jun 2014 Ethan Fetaya, Ohad Shamir, Shimon Ullman

We consider the problem of learning from a similarity matrix (such as spectral clustering and lowd imensional embedding), when computing pairwise similarities are costly, and only a limited number of entries can be observed.

Communication Efficient Distributed Optimization using an Approximate Newton-type Method

1 code implementation30 Dec 2013 Ohad Shamir, Nathan Srebro, Tong Zhang

We present a novel Newton-type method for distributed optimization, which is particularly well suited for stochastic optimization and learning problems.

Distributed Optimization

Online Learning with Costly Features and Labels

no code implementations NeurIPS 2013 Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir

In particular, we show that with switching costs, the attainable rate with bandit feedback is $T^{2/3}$.

Probabilistic Label Trees for Efficient Large Scale Image Classification

no code implementations CVPR 2013 Baoyuan Liu, Fereshteh Sadeghi, Marshall Tappen, Ohad Shamir, Ce Liu

Large-scale recognition problems with thousands of classes pose a particular challenge because applying the classifier requires more computation as the number of classes grows.

General Classification Image Classification

An Algorithm for Training Polynomial Networks

no code implementations26 Apr 2013 Roi Livni, Shai Shalev-Shwartz, Ohad Shamir

The main goal of this paper is the derivation of an efficient layer-by-layer algorithm for training such networks, which we denote as the \emph{Basis Learner}.

Online Learning with Switching Costs and Other Adaptive Adversaries

no code implementations NeurIPS 2013 Nicolo Cesa-Bianchi, Ofer Dekel, Ohad Shamir

In particular, we show that with switching costs, the attainable rate with bandit feedback is $\widetilde{\Theta}(T^{2/3})$.

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

no code implementations11 Sep 2012 Ohad Shamir

The problem of stochastic convex optimization with bandit feedback (in the learning community) or without knowledge of gradients (in the optimization community) has received much attention in recent years, in the form of algorithms and performance upper bounds.

Better Mini-Batch Algorithms via Accelerated Gradient Methods

no code implementations NeurIPS 2011 Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan

Mini-batch algorithms have recently received significant attention as a way to speed-up stochastic convex optimization problems.

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

no code implementations NeurIPS 2011 Sham M. Kakade, Varun Kanade, Ohad Shamir, Adam Kalai

In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efficient.

From Bandits to Experts: On the Value of Side-Observations

no code implementations NeurIPS 2011 Shie Mannor, Ohad Shamir

We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game.

Multi-Armed Bandits

Efficient Online Learning via Randomized Rounding

no code implementations NeurIPS 2011 Nicolò Cesa-Bianchi, Ohad Shamir

Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader.

Collaborative Filtering

Learning with the weighted trace-norm under arbitrary sampling distributions

no code implementations NeurIPS 2011 Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan R. Salakhutdinov

We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions.

Efficient Transductive Online Learning via Randomized Rounding

no code implementations13 Jun 2011 Nicolò Cesa-Bianchi, Ohad Shamir

Most traditional online learning algorithms are based on variants of mirror descent or follow-the-leader.

Collaborative Filtering

Learning Exponential Families in High-Dimensions: Strong Convexity and Sparsity

no code implementations31 Oct 2009 Sham M. Kakade, Ohad Shamir, Karthik Sridharan, Ambuj Tewari

The versatility of exponential families, along with their attendant convexity properties, make them a popular and effective statistical model.

On the Reliability of Clustering Stability in the Large Sample Regime

no code implementations NeurIPS 2008 Ohad Shamir, Naftali Tishby

In this paper, we provide a set of general sufficient conditions, which ensure the reliability of clustering stability estimators in the large sample regime.

Model Selection

Cannot find the paper you are looking for? You can Submit a new open access paper.