Search Results for author: Eran Malach

Found 26 papers, 4 papers with code

Provable Guarantees on Learning Hierarchical Generative Models with Deep CNNs

no code implementations • Eran Malach, Shai Shalev-Shwartz

To show any positive theoretical results, one must make assumptions on the data distribution.

Paper
Add Code

The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

no code implementations • 16 Feb 2024 • Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, Nikolaos Tsilivis

We examine how learning is affected by varying the prior distribution over Markov chains, and consider the generalization of our in-context learning of Markov chains (ICL-MC) task to $n$-grams for $n > 2$.

In-Context Learning

Paper
Add Code

Repeat After Me: Transformers are Better than State Space Models at Copying

1 code implementation • 1 Feb 2024 • Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context.

161

Paper
Code

Auto-Regressive Next-Token Predictors are Universal Learners

no code implementations • 13 Sep 2023 • Eran Malach

Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks.

Mathematical Reasoning Text Generation

Paper
Add Code

Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck

no code implementations • 7 Sep 2023 • Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

Finally, we show that the synthetic sparse parity task can be useful as a proxy for real problems requiring axis-aligned feature learning.

tabular-classification

Paper
Add Code

Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

no code implementations • 4 Sep 2023 • Etay Livne, Gal Kaplun, Eran Malach, Shai Shalev-Schwatz

However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient.

Paper
Add Code

Less is More: Selective Layer Finetuning with SubTuning

1 code implementation • 13 Feb 2023 • Gal Kaplun, Andrey Gurevich, Tal Swisa, Mazor David, Shai Shalev-Shwartz, Eran Malach

Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance.

Multi-Task Learning

Paper
Code

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

no code implementations • 18 Jul 2022 • Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times.

Paper
Add Code

Knowledge Distillation: Bad Models Can Be Good Role Models

no code implementations • 28 Mar 2022 • Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz

We relate the notion of such samplers to knowledge distillation, where a student network imitates the outputs of a teacher on unlabeled data.

Knowledge Distillation Learning Theory

Paper
Add Code

Provable Learning of Convolutional Neural Networks with Data Driven Features

no code implementations • 29 Sep 2021 • Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz

Convolutional networks (CNN) are computationally hard to learn.

Paper
Add Code

On the Power of Differentiable Learning versus PAC and SQ Learning

no code implementations • NeurIPS 2021 • Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, Nathan Srebro

With fine enough precision relative to minibatch size, namely when $b \rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$.

PAC learning

Paper
Add Code

Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

no code implementations • 1 Mar 2021 • Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro

Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.

Paper
Add Code

The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

no code implementations • 31 Jan 2021 • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks.

Paper
Add Code

The Implications of Local Correlation on Learning Some Deep Functions

no code implementations • NeurIPS 2020 • Eran Malach, Shai Shalev-Shwartz

In fact, the proofs of such hardness results show that even weakly learning deep networks is hard.

Paper
Add Code

Computational Separation Between Convolutional and Fully-Connected Networks

no code implementations • ICLR 2021 • Eran Malach, Shai Shalev-Shwartz

Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks.

Paper
Add Code

When Hardness of Approximation Meets Hardness of Learning

no code implementations • 18 Aug 2020 • Eran Malach, Shai Shalev-Shwartz

A supervised learning algorithm has access to a distribution of labeled examples, and needs to return a function (hypothesis) that correctly labels the examples.

Paper
Add Code

Learning Parities with Neural Networks

no code implementations • NeurIPS 2020 • Amit Daniely, Eran Malach

On the other hand, under the same distributions, these parities cannot be learned efficiently by linear methods.

Paper
Add Code

Proving the Lottery Ticket Hypothesis: Pruning is All You Need

no code implementations • ICML 2020 • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network.

Paper
Add Code

Learning Boolean Circuits with Neural Networks

no code implementations • 25 Oct 2019 • Eran Malach, Shai Shalev-Shwartz

To separate hard from easy to learn distributions, we observe the property of local correlation: correlation between local patterns of the input and the target label.

Paper
Add Code

On the Optimality of Trees Generated by ID3

no code implementations • 11 Jul 2019 • Alon Brutzkus, Amit Daniely, Eran Malach

Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees.

Paper
Add Code

ID3 Learns Juntas for Smoothed Product Distributions

no code implementations • 20 Jun 2019 • Alon Brutzkus, Amit Daniely, Eran Malach

In recent years, there are many attempts to understand popular heuristics.

Paper
Add Code

Decoupling Gating from Linearity

no code implementations • ICLR 2019 • Jonathan Fiat, Eran Malach, Shai Shalev-Shwartz

Specifically, we show a memorization result for networks of size $\tilde{\Omega}(\frac{m}{d})$, and improved generalization bounds.

Generalization Bounds Memorization

Paper
Add Code

Is Deeper Better only when Shallow is Good?

1 code implementation • NeurIPS 2019 • Eran Malach, Shai Shalev-Shwartz

Using this result we prove that, at least in some distributions, the success of learning deep networks depends on whether the distribution can be well approximated by shallower networks, and we conjecture that this property holds in general.

Learning Theory Open-Ended Question Answering

Paper
Code

A Provably Correct Algorithm for Deep Learning that Actually Works

no code implementations • 26 Mar 2018 • Eran Malach, Shai Shalev-Shwartz

We describe a layer-by-layer algorithm for training deep convolutional networks, where each step involves gradient updates for a two layer network followed by a simple clustering algorithm.

Clustering

Paper
Add Code

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

no code implementations • ICLR 2018 • Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz

Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations.

Generalization Bounds

Paper
Add Code

Decoupling "when to update" from "how to update"

1 code implementation • NeurIPS 2017 • Eran Malach, Shai Shalev-Shwartz

Unfortunately, this approach often leads to noisy labels.

Face Recognition Gender Classification +1

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.