no code implementations • Eran Malach, Shai Shalev-Shwartz

To show any positive theoretical results, one must make assumptions on the data distribution.

no code implementations • 3 Jul 2024 • Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, Eran Malach

Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models.

no code implementations • 25 Jun 2024 • Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, Lucas Janson

Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community.

no code implementations • 17 Jun 2024 • Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach

Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on.

no code implementations • 16 Feb 2024 • Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, Nikolaos Tsilivis

We examine how learning is affected by varying the prior distribution over Markov chains, and consider the generalization of our in-context learning of Markov chains (ICL-MC) task to $n$-grams for $n > 2$.

2 code implementations • 1 Feb 2024 • Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context.

1 code implementation • 13 Sep 2023 • Eran Malach

Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks.

no code implementations • 7 Sep 2023 • Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

Finally, we show that the synthetic sparse parity task can be useful as a proxy for real problems requiring axis-aligned feature learning.

no code implementations • 4 Sep 2023 • Etay Livne, Gal Kaplun, Eran Malach, Shai Shalev-Schwatz

However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient.

1 code implementation • 13 Feb 2023 • Gal Kaplun, Andrey Gurevich, Tal Swisa, Mazor David, Shai Shalev-Shwartz, Eran Malach

Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance.

no code implementations • 18 Jul 2022 • Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times.

no code implementations • 28 Mar 2022 • Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz

We relate the notion of such samplers to knowledge distillation, where a student network imitates the outputs of a teacher on unlabeled data.

no code implementations • 29 Sep 2021 • Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz

Convolutional networks (CNN) are computationally hard to learn.

no code implementations • NeurIPS 2021 • Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, Nathan Srebro

With fine enough precision relative to minibatch size, namely when $b \rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$.

no code implementations • 1 Mar 2021 • Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro

Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.

no code implementations • 31 Jan 2021 • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks.

no code implementations • NeurIPS 2020 • Eran Malach, Shai Shalev-Shwartz

In fact, the proofs of such hardness results show that even weakly learning deep networks is hard.

no code implementations • ICLR 2021 • Eran Malach, Shai Shalev-Shwartz

Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks.

no code implementations • 18 Aug 2020 • Eran Malach, Shai Shalev-Shwartz

A supervised learning algorithm has access to a distribution of labeled examples, and needs to return a function (hypothesis) that correctly labels the examples.

no code implementations • NeurIPS 2020 • Amit Daniely, Eran Malach

On the other hand, under the same distributions, these parities cannot be learned efficiently by linear methods.

no code implementations • ICML 2020 • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network.

no code implementations • 25 Oct 2019 • Eran Malach, Shai Shalev-Shwartz

To separate hard from easy to learn distributions, we observe the property of local correlation: correlation between local patterns of the input and the target label.

no code implementations • 11 Jul 2019 • Alon Brutzkus, Amit Daniely, Eran Malach

Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees.

no code implementations • 20 Jun 2019 • Alon Brutzkus, Amit Daniely, Eran Malach

In recent years, there are many attempts to understand popular heuristics.

no code implementations • ICLR 2019 • Jonathan Fiat, Eran Malach, Shai Shalev-Shwartz

Specifically, we show a memorization result for networks of size $\tilde{\Omega}(\frac{m}{d})$, and improved generalization bounds.

1 code implementation • NeurIPS 2019 • Eran Malach, Shai Shalev-Shwartz

Using this result we prove that, at least in some distributions, the success of learning deep networks depends on whether the distribution can be well approximated by shallower networks, and we conjecture that this property holds in general.

no code implementations • 26 Mar 2018 • Eran Malach, Shai Shalev-Shwartz

We describe a layer-by-layer algorithm for training deep convolutional networks, where each step involves gradient updates for a two layer network followed by a simple clustering algorithm.

no code implementations • ICLR 2018 • Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz

Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations.

1 code implementation • NeurIPS 2017 • Eran Malach, Shai Shalev-Shwartz

Unfortunately, this approach often leads to noisy labels.

Cannot find the paper you are looking for? You can
Submit a new open access paper.

Contact us on:
hello@paperswithcode.com
.
Papers With Code is a free resource with all data licensed under CC-BY-SA.