Search Results for author: Kaifeng Lyu

Found 18 papers, 8 papers with code

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

1 code implementation • 28 Feb 2024 • Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora

Public LLMs such as the Llama 2-Chat have driven huge activity in LLM research.

Paper
Code

RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval

1 code implementation • 28 Feb 2024 • Kaiyue Wen, Xingyu Dang, Kaifeng Lyu

This paper investigates the gap in representation powers of Recurrent Neural Networks (RNNs) and Transformers in the context of solving algorithmic problems.

Retrieval

Paper
Code

Efficient Stagewise Pretraining via Progressive Subnetworks

no code implementations • 8 Feb 2024 • Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, Sanjiv Kumar

RaPTr achieves better pre-training loss for BERT and UL2 language models while requiring 20-33% fewer FLOPs compared to standard training, and is competitive or better than other efficient training methods.

Paper
Add Code

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

1 code implementation • 30 Nov 2023 • Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S. Du, Jason D. Lee, Wei Hu

Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy.

Paper
Code

A Quadratic Synchronization Rule for Distributed Deep Learning

1 code implementation • 22 Oct 2023 • Xinran Gu, Kaifeng Lyu, Sanjeev Arora, Jingzhao Zhang, Longbo Huang

In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models.

Paper
Code

DistillSpec: Improving Speculative Decoding via Knowledge Distillation

no code implementations • 12 Oct 2023 • Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal

Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.

Knowledge Distillation Language Modelling +1

Paper
Add Code

The Marginal Value of Momentum for Small Learning Rate SGD

no code implementations • 27 Jul 2023 • Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise.

Stochastic Optimization

Paper
Add Code

Why (and When) does Local SGD Generalize Better than SGD?

1 code implementation • 2 Mar 2023 • Xinran Gu, Kaifeng Lyu, Longbo Huang, Sanjeev Arora

Local SGD is a communication-efficient variant of SGD for large-scale training, where multiple GPUs perform SGD independently and average the model parameters periodically.

Paper
Code

Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing

no code implementations • 27 Jan 2023 • Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S. Du, Jason D. Lee

It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models.

Incremental Learning

Paper
Add Code

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

1 code implementation • 5 Nov 2022 • Arushi Gupta, Nikunj Saunshi, Dingli Yu, Kaifeng Lyu, Sanjeev Arora

Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net.

Paper
Code

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

no code implementations • 14 Jun 2022 • Kaifeng Lyu, Zhiyuan Li, Sanjeev Arora

Normalization layers (e. g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets.

Paper
Add Code

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

1 code implementation • 20 May 2022 • Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD.

Paper
Code

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

no code implementations • NeurIPS 2021 • Kaifeng Lyu, Zhiyuan Li, Runzhe Wang, Sanjeev Arora

The current paper is able to establish this global optimality for two-layer Leaky ReLU nets trained with gradient flow on linearly separable and symmetric data, regardless of the width.

Vocal Bursts Valence Prediction

Paper
Add Code

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic and Sound

no code implementations • 29 Sep 2021 • Arushi Gupta, Nikunj Saunshi, Dingli Yu, Kaifeng Lyu, Sanjeev Arora

Saliency methods seek to provide human-interpretable explanations for the output of machine learning model on a given input.

Paper
Add Code

Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning

no code implementations • ICLR 2021 • Zhiyuan Li, Yuping Luo, Kaifeng Lyu

Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of gradient descent.

Paper
Add Code

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

no code implementations • NeurIPS 2020 • Zhiyuan Li, Kaifeng Lyu, Sanjeev Arora

Recent works (e. g., (Li and Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e. g., use of exponentially increasing learning rates.

Paper
Add Code

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

1 code implementation • ICLR 2020 • Kaifeng Lyu, Jian Li

In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations.

Paper
Code

Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

no code implementations • ICLR 2019 • Sanjeev Arora, Zhiyuan Li, Kaifeng Lyu

Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.