Search Results for author: Chulhee Yun

Found 24 papers, 2 papers with code

Fundamental Benefit of Alternating Updates in Minimax Optimization

no code implementations16 Feb 2024 Jaewook Lee, Hanseul Cho, Chulhee Yun

The Gradient Descent-Ascent (GDA) algorithm, designed to solve minimax optimization problems, takes the descent and ascent steps either simultaneously (Sim-GDA) or alternately (Alt-GDA).

Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study

no code implementations25 Nov 2023 Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun

Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive.

Fair Streaming Principal Component Analysis: Statistical and Algorithmic Viewpoint

1 code implementation NeurIPS 2023 Junghyun Lee, Hanseul Cho, Se-Young Yun, Chulhee Yun

Fair Principal Component Analysis (PCA) is a problem setting where we aim to perform PCA while making the resulting representation fair in that the projected distributions, conditional on the sensitive attributes, match one another.

Linear attention is (maybe) all you need (to understand transformer optimization)

1 code implementation2 Oct 2023 Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics.

Provable Benefit of Mixup for Finding Optimal Decision Boundaries

no code implementations1 Jun 2023 Junsoo Oh, Chulhee Yun

We investigate how pair-wise data augmentation techniques like Mixup affect the sample complexity of finding optimal decision boundaries in a binary linear classification problem.

Data Augmentation

Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond

no code implementations13 Mar 2023 Jaeyoung Cha, Jaewook Lee, Chulhee Yun

We study convergence lower bounds of without-replacement stochastic gradient descent (SGD) for solving smooth (strongly-)convex finite-sum minimization problems.

On the Training Instability of Shuffling SGD with Batch Normalization

no code implementations24 Feb 2023 David X. Wu, Chulhee Yun, Suvrit Sra

We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence.

regression

SGDA with shuffling: faster convergence for nonconvex-PŁ minimax optimization

no code implementations12 Oct 2022 Hanseul Cho, Chulhee Yun

Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems.

Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

no code implementations ICLR 2022 Chulhee Yun, Shashank Rajput, Suvrit Sra

In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods.

Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

no code implementations12 Mar 2021 Chulhee Yun, Suvrit Sra, Ali Jadbabaie

We propose matrix norm inequalities that extend the Recht-R\'e (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-R\'e conjecture.

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations NeurIPS 2020 Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

Provable Memorization via Deep Neural Networks using Sub-linear Parameters

no code implementations26 Oct 2020 Sejun Park, Jaeho Lee, Chulhee Yun, Jinwoo Shin

It is known that $O(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs.

Memorization

A Unifying View on Implicit Bias in Training Linear Neural Networks

no code implementations ICLR 2021 Chulhee Yun, Shankar Krishnan, Hossein Mobahi

For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network.

Tensor Networks

Minimum Width for Universal Approximation

no code implementations ICLR 2021 Sejun Park, Chulhee Yun, Jaeho Lee, Jinwoo Shin

In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The minimum width required for the universal approximation of the $L^p$ functions is exactly $\max\{d_x+1, d_y\}$.

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations NeurIPS 2020 Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

Low-Rank Bottleneck in Multi-head Attention Models

no code implementations ICML 2020 Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

Are Transformers universal approximators of sequence-to-sequence functions?

no code implementations ICLR 2020 Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.

Concise Multi-head Attention Models

no code implementations25 Sep 2019 Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

Are deep ResNets provably better than linear predictors?

no code implementations NeurIPS 2019 Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor.

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

no code implementations NeurIPS 2019 Chulhee Yun, Suvrit Sra, Ali Jadbabaie

We also prove that width $\Theta(\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity.

Memorization

Efficiently testing local optimality and escaping saddles for ReLU networks

no code implementations ICLR 2019 Chulhee Yun, Suvrit Sra, Ali Jadbabaie

In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast.

Small nonlinearities in activation functions create bad local minima in neural networks

no code implementations ICLR 2019 Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust.

Global optimality conditions for deep neural networks

no code implementations ICLR 2018 Chulhee Yun, Suvrit Sra, Ali Jadbabaie

We study the error landscape of deep linear and nonlinear neural networks with the squared error loss.

Cannot find the paper you are looking for? You can Submit a new open access paper.