no code implementations • Eran Malach, Shai Shalev-Shwartz
To show any positive theoretical results, one must make assumptions on the data distribution.
no code implementations • 3 Jul 2024 • Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, Eran Malach
Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models.
no code implementations • 25 Jun 2024 • Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, Lucas Janson
Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community.
no code implementations • 17 Jun 2024 • Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach
Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on.
no code implementations • 16 Feb 2024 • Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, Nikolaos Tsilivis
We examine how learning is affected by varying the prior distribution over Markov chains, and consider the generalization of our in-context learning of Markov chains (ICL-MC) task to $n$-grams for $n > 2$.
2 code implementations • 1 Feb 2024 • Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach
Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context.
1 code implementation • 13 Sep 2023 • Eran Malach
Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks.
no code implementations • 7 Sep 2023 • Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang
Finally, we show that the synthetic sparse parity task can be useful as a proxy for real problems requiring axis-aligned feature learning.
no code implementations • 4 Sep 2023 • Etay Livne, Gal Kaplun, Eran Malach, Shai Shalev-Schwatz
However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient.
1 code implementation • 13 Feb 2023 • Gal Kaplun, Andrey Gurevich, Tal Swisa, Mazor David, Shai Shalev-Shwartz, Eran Malach
Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance.
no code implementations • 18 Jul 2022 • Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang
There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times.
no code implementations • 28 Mar 2022 • Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz
We relate the notion of such samplers to knowledge distillation, where a student network imitates the outputs of a teacher on unlabeled data.
no code implementations • 29 Sep 2021 • Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz
Convolutional networks (CNN) are computationally hard to learn.
no code implementations • NeurIPS 2021 • Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, Nathan Srebro
With fine enough precision relative to minibatch size, namely when $b \rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$.
no code implementations • 1 Mar 2021 • Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro
Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.
no code implementations • 31 Jan 2021 • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir
On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks.
no code implementations • NeurIPS 2020 • Eran Malach, Shai Shalev-Shwartz
In fact, the proofs of such hardness results show that even weakly learning deep networks is hard.
no code implementations • ICLR 2021 • Eran Malach, Shai Shalev-Shwartz
Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks.
no code implementations • 18 Aug 2020 • Eran Malach, Shai Shalev-Shwartz
A supervised learning algorithm has access to a distribution of labeled examples, and needs to return a function (hypothesis) that correctly labels the examples.
no code implementations • NeurIPS 2020 • Amit Daniely, Eran Malach
On the other hand, under the same distributions, these parities cannot be learned efficiently by linear methods.
no code implementations • ICML 2020 • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir
The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network.
no code implementations • 25 Oct 2019 • Eran Malach, Shai Shalev-Shwartz
To separate hard from easy to learn distributions, we observe the property of local correlation: correlation between local patterns of the input and the target label.
no code implementations • 11 Jul 2019 • Alon Brutzkus, Amit Daniely, Eran Malach
Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees.
no code implementations • 20 Jun 2019 • Alon Brutzkus, Amit Daniely, Eran Malach
In recent years, there are many attempts to understand popular heuristics.
no code implementations • ICLR 2019 • Jonathan Fiat, Eran Malach, Shai Shalev-Shwartz
Specifically, we show a memorization result for networks of size $\tilde{\Omega}(\frac{m}{d})$, and improved generalization bounds.
1 code implementation • NeurIPS 2019 • Eran Malach, Shai Shalev-Shwartz
Using this result we prove that, at least in some distributions, the success of learning deep networks depends on whether the distribution can be well approximated by shallower networks, and we conjecture that this property holds in general.
no code implementations • 26 Mar 2018 • Eran Malach, Shai Shalev-Shwartz
We describe a layer-by-layer algorithm for training deep convolutional networks, where each step involves gradient updates for a two layer network followed by a simple clustering algorithm.
no code implementations • ICLR 2018 • Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz
Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations.
1 code implementation • NeurIPS 2017 • Eran Malach, Shai Shalev-Shwartz
Unfortunately, this approach often leads to noisy labels.