Search Results for author: Eran Malach

Found 29 papers, 5 papers with code

Provable Guarantees on Learning Hierarchical Generative Models with Deep CNNs

no code implementations Eran Malach, Shai Shalev-Shwartz

To show any positive theoretical results, one must make assumptions on the data distribution.

Universal Length Generalization with Turing Programs

no code implementations3 Jul 2024 Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, Eran Malach

Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models.

A New Perspective on Shampoo's Preconditioner

no code implementations25 Jun 2024 Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, Lucas Janson

Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community.

Transcendence: Generative Models Can Outperform The Experts That Train Them

no code implementations17 Jun 2024 Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach

Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on.

The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

no code implementations16 Feb 2024 Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, Nikolaos Tsilivis

We examine how learning is affected by varying the prior distribution over Markov chains, and consider the generalization of our in-context learning of Markov chains (ICL-MC) task to $n$-grams for $n > 2$.

In-Context Learning

Repeat After Me: Transformers are Better than State Space Models at Copying

2 code implementations1 Feb 2024 Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context.

Auto-Regressive Next-Token Predictors are Universal Learners

1 code implementation13 Sep 2023 Eran Malach

Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks.

Mathematical Reasoning Text Generation

Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck

no code implementations7 Sep 2023 Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

Finally, we show that the synthetic sparse parity task can be useful as a proxy for real problems requiring axis-aligned feature learning.

tabular-classification

Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

no code implementations4 Sep 2023 Etay Livne, Gal Kaplun, Eran Malach, Shai Shalev-Schwatz

However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient.

Less is More: Selective Layer Finetuning with SubTuning

1 code implementation13 Feb 2023 Gal Kaplun, Andrey Gurevich, Tal Swisa, Mazor David, Shai Shalev-Shwartz, Eran Malach

Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance.

Multi-Task Learning

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

no code implementations18 Jul 2022 Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times.

Knowledge Distillation: Bad Models Can Be Good Role Models

no code implementations28 Mar 2022 Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz

We relate the notion of such samplers to knowledge distillation, where a student network imitates the outputs of a teacher on unlabeled data.

Knowledge Distillation Learning Theory

On the Power of Differentiable Learning versus PAC and SQ Learning

no code implementations NeurIPS 2021 Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, Nathan Srebro

With fine enough precision relative to minibatch size, namely when $b \rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$.

PAC learning

Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

no code implementations1 Mar 2021 Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro

Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.

The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

no code implementations31 Jan 2021 Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks.

The Implications of Local Correlation on Learning Some Deep Functions

no code implementations NeurIPS 2020 Eran Malach, Shai Shalev-Shwartz

In fact, the proofs of such hardness results show that even weakly learning deep networks is hard.

Computational Separation Between Convolutional and Fully-Connected Networks

no code implementations ICLR 2021 Eran Malach, Shai Shalev-Shwartz

Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks.

When Hardness of Approximation Meets Hardness of Learning

no code implementations18 Aug 2020 Eran Malach, Shai Shalev-Shwartz

A supervised learning algorithm has access to a distribution of labeled examples, and needs to return a function (hypothesis) that correctly labels the examples.

Learning Parities with Neural Networks

no code implementations NeurIPS 2020 Amit Daniely, Eran Malach

On the other hand, under the same distributions, these parities cannot be learned efficiently by linear methods.

Proving the Lottery Ticket Hypothesis: Pruning is All You Need

no code implementations ICML 2020 Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network.

Learning Boolean Circuits with Neural Networks

no code implementations25 Oct 2019 Eran Malach, Shai Shalev-Shwartz

To separate hard from easy to learn distributions, we observe the property of local correlation: correlation between local patterns of the input and the target label.

On the Optimality of Trees Generated by ID3

no code implementations11 Jul 2019 Alon Brutzkus, Amit Daniely, Eran Malach

Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees.

ID3 Learns Juntas for Smoothed Product Distributions

no code implementations20 Jun 2019 Alon Brutzkus, Amit Daniely, Eran Malach

In recent years, there are many attempts to understand popular heuristics.

Decoupling Gating from Linearity

no code implementations ICLR 2019 Jonathan Fiat, Eran Malach, Shai Shalev-Shwartz

Specifically, we show a memorization result for networks of size $\tilde{\Omega}(\frac{m}{d})$, and improved generalization bounds.

Generalization Bounds Memorization

Is Deeper Better only when Shallow is Good?

1 code implementation NeurIPS 2019 Eran Malach, Shai Shalev-Shwartz

Using this result we prove that, at least in some distributions, the success of learning deep networks depends on whether the distribution can be well approximated by shallower networks, and we conjecture that this property holds in general.

Learning Theory Open-Ended Question Answering

A Provably Correct Algorithm for Deep Learning that Actually Works

no code implementations26 Mar 2018 Eran Malach, Shai Shalev-Shwartz

We describe a layer-by-layer algorithm for training deep convolutional networks, where each step involves gradient updates for a two layer network followed by a simple clustering algorithm.

Clustering

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

no code implementations ICLR 2018 Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz

Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations.

Generalization Bounds

Cannot find the paper you are looking for? You can Submit a new open access paper.