Search Results for author: Etai Littwin

Found 23 papers, 4 papers with code

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

1 code implementation11 Mar 2023 Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind

We show that $\sigma$Reparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer {to competitive performance} without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers.

Automatic Speech Recognition Image Classification +6

When can transformers reason with abstract symbols?

1 code implementation15 Oct 2023 Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, Joshua Susskind

We investigate the capabilities of transformer large language models (LLMs) on relational reasoning tasks involving abstract symbols.

Relational Reasoning

Vanishing Gradients in Reinforcement Finetuning of Language Models

1 code implementation31 Oct 2023 Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, Etai Littwin

Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms.

On Infinite-Width Hypernetworks

1 code implementation NeurIPS 2020 Etai Littwin, Tomer Galanti, Lior Wolf, Greg Yang

{\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}.

Meta-Learning

The Loss Surface of Residual Networks: Ensembles and the Role of Batch Normalization

no code implementations8 Nov 2016 Etai Littwin, Lior Wolf

Deep Residual Networks present a premium in performance in comparison to conventional networks of the same depth and are trainable at extreme depths.

The Multiverse Loss for Robust Transfer Learning

no code implementations CVPR 2016 Etai Littwin, Lior Wolf

Deep learning techniques are renowned for supporting effective transfer learning.

Transfer Learning

Regularizing by the Variance of the Activations' Sample-Variances

no code implementations NeurIPS 2018 Etai Littwin, Lior Wolf

Normalization techniques play an important role in supporting efficient and often more effective training of deep neural networks.

Spherical Embedding of Inlier Silhouette Dissimilarities

no code implementations CVPR 2015 Etai Littwin, Hadar Averbuch-Elor, Daniel Cohen-Or

In this paper, we introduce a spherical embedding technique to position a given set of silhouettes of an object as observed from a set of cameras arbitrarily positioned around the object.

Position

On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width

no code implementations ICML Workshop Deep_Phenomen 2019 Etai Littwin, Lior Wolf

The Hessian of neural networks can be decomposed into a sum of two matrices: (i) the positive semidefinite generalized Gauss-Newton matrix G, and (ii) the matrix H containing negative eigenvalues.

Relation

On Random Kernels of Residual Architectures

no code implementations28 Jan 2020 Etai Littwin, Tomer Galanti, Lior Wolf

We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.

Collegial Ensembles

no code implementations NeurIPS 2020 Etai Littwin, Ben Myara, Sima Sabah, Joshua Susskind, Shuangfei Zhai, Oren Golan

Modern neural network performance typically improves as model size increases.

Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics

no code implementations8 May 2021 Greg Yang, Etai Littwin

To facilitate this proof, we develop a graphical notation for Tensor Programs.

Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks

no code implementations2 Jul 2021 Shih-Yu Sun, Vimal Thilak, Etai Littwin, Omid Saremi, Joshua M. Susskind

Deep linear networks trained with gradient descent yield low rank solutions, as is typically studied in matrix factorization.

The Effect of Residual Architecture on the Per-Layer Gradient of Deep Networks

no code implementations25 Sep 2019 Etai Littwin, Lior Wolf

A critical part of the training process of neural networks takes place in the very first gradient steps post initialization.

Learning Representation from Neural Fisher Kernel with Low-rank Approximation

no code implementations ICLR 2022 Ruixiang Zhang, Shuangfei Zhai, Etai Littwin, Josh Susskind

We show that the low-rank approximation of NFKs derived from unsupervised generative models and supervised learning models gives rise to high-quality compact representations of data, achieving competitive results on a variety of machine learning tasks.

The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

no code implementations10 Jun 2022 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua Susskind

While common and easily reproduced in more general settings, the Slingshot Mechanism does not follow from any known optimization theories that we are aware of, and can be easily overlooked without an in depth examination.

Inductive Bias

Tight conditions for when the NTK approximation is valid

no code implementations22 May 2023 Enric Boix-Adsera, Etai Littwin

We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss.

valid

Transformers learn through gradual rank increase

no code implementations NeurIPS 2023 Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, Joshua Susskind

Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions.

Incremental Learning

Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

no code implementations3 Aug 2023 Greg Yang, Etai Littwin

Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam?

Adaptivity and Modularity for Efficient Generalization Over Task Complexity

no code implementations13 Oct 2023 Samira Abnar, Omid Saremi, Laurent Dinh, Shantel Wilson, Miguel Angel Bautista, Chen Huang, Vimal Thilak, Etai Littwin, Jiatao Gu, Josh Susskind, Samy Bengio

We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential computation steps (i. e., the depth of the computation graph).

Retrieval

What Algorithms can Transformers Learn? A Study in Length Generalization

no code implementations24 Oct 2023 Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, Preetum Nakkiran

Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity.

LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures

no code implementations7 Dec 2023 Vimal Thilak, Chen Huang, Omid Saremi, Laurent Dinh, Hanlin Goh, Preetum Nakkiran, Joshua M. Susskind, Etai Littwin

In this paper, we introduce LiDAR (Linear Discriminant Analysis Rank), a metric designed to measure the quality of representations within JE architectures.

Cannot find the paper you are looking for? You can Submit a new open access paper.