1 code implementation • 28 Feb 2024 • Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora
Public LLMs such as the Llama 2-Chat have driven huge activity in LLM research.
1 code implementation • 28 Feb 2024 • Kaiyue Wen, Xingyu Dang, Kaifeng Lyu
This paper investigates the gap in representation powers of Recurrent Neural Networks (RNNs) and Transformers in the context of solving algorithmic problems.
no code implementations • 8 Feb 2024 • Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, Sanjiv Kumar
RaPTr achieves better pre-training loss for BERT and UL2 language models while requiring 20-33% fewer FLOPs compared to standard training, and is competitive or better than other efficient training methods.
1 code implementation • 30 Nov 2023 • Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S. Du, Jason D. Lee, Wei Hu
Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy.
1 code implementation • 22 Oct 2023 • Xinran Gu, Kaifeng Lyu, Sanjeev Arora, Jingzhao Zhang, Longbo Huang
In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models.
no code implementations • 12 Oct 2023 • Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal
Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.
no code implementations • 27 Jul 2023 • Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise.
1 code implementation • 2 Mar 2023 • Xinran Gu, Kaifeng Lyu, Longbo Huang, Sanjeev Arora
Local SGD is a communication-efficient variant of SGD for large-scale training, where multiple GPUs perform SGD independently and average the model parameters periodically.
no code implementations • 27 Jan 2023 • Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S. Du, Jason D. Lee
It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models.
1 code implementation • 5 Nov 2022 • Arushi Gupta, Nikunj Saunshi, Dingli Yu, Kaifeng Lyu, Sanjeev Arora
Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net.
no code implementations • 14 Jun 2022 • Kaifeng Lyu, Zhiyuan Li, Sanjeev Arora
Normalization layers (e. g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets.
1 code implementation • 20 May 2022 • Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora
Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD.
no code implementations • NeurIPS 2021 • Kaifeng Lyu, Zhiyuan Li, Runzhe Wang, Sanjeev Arora
The current paper is able to establish this global optimality for two-layer Leaky ReLU nets trained with gradient flow on linearly separable and symmetric data, regardless of the width.
no code implementations • 29 Sep 2021 • Arushi Gupta, Nikunj Saunshi, Dingli Yu, Kaifeng Lyu, Sanjeev Arora
Saliency methods seek to provide human-interpretable explanations for the output of machine learning model on a given input.
no code implementations • ICLR 2021 • Zhiyuan Li, Yuping Luo, Kaifeng Lyu
Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of gradient descent.
no code implementations • NeurIPS 2020 • Zhiyuan Li, Kaifeng Lyu, Sanjeev Arora
Recent works (e. g., (Li and Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e. g., use of exponentially increasing learning rates.
1 code implementation • ICLR 2020 • Kaifeng Lyu, Jian Li
In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations.
no code implementations • ICLR 2019 • Sanjeev Arora, Zhiyuan Li, Kaifeng Lyu
Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization.