no code implementations • 16 Feb 2024 • Jaewook Lee, Hanseul Cho, Chulhee Yun
The Gradient Descent-Ascent (GDA) algorithm, designed to solve minimax optimization problems, takes the descent and ascent steps either simultaneously (Sim-GDA) or alternately (Alt-GDA).
no code implementations • 25 Nov 2023 • Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun
Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive.
1 code implementation • NeurIPS 2023 • Junghyun Lee, Hanseul Cho, Se-Young Yun, Chulhee Yun
Fair Principal Component Analysis (PCA) is a problem setting where we aim to perform PCA while making the resulting representation fair in that the projected distributions, conditional on the sensitive attributes, match one another.
1 code implementation • 2 Oct 2023 • Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics.
no code implementations • 1 Jun 2023 • Junsoo Oh, Chulhee Yun
We investigate how pair-wise data augmentation techniques like Mixup affect the sample complexity of finding optimal decision boundaries in a binary linear classification problem.
no code implementations • 13 Mar 2023 • Jaeyoung Cha, Jaewook Lee, Chulhee Yun
We study convergence lower bounds of without-replacement stochastic gradient descent (SGD) for solving smooth (strongly-)convex finite-sum minimization problems.
no code implementations • 24 Feb 2023 • David X. Wu, Chulhee Yun, Suvrit Sra
We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence.
no code implementations • 12 Oct 2022 • Hanseul Cho, Chulhee Yun
Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems.
no code implementations • ICLR 2022 • Chulhee Yun, Shashank Rajput, Suvrit Sra
In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods.
no code implementations • 12 Mar 2021 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
We propose matrix norm inequalities that extend the Recht-R\'e (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-R\'e conjecture.
no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar
We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.
no code implementations • 26 Oct 2020 • Sejun Park, Jaeho Lee, Chulhee Yun, Jinwoo Shin
It is known that $O(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs.
no code implementations • ICLR 2021 • Chulhee Yun, Shankar Krishnan, Hossein Mobahi
For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network.
no code implementations • ICLR 2021 • Sejun Park, Chulhee Yun, Jaeho Lee, Jinwoo Shin
In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The minimum width required for the universal approximation of the $L^p$ functions is exactly $\max\{d_x+1, d_y\}$.
no code implementations • NeurIPS 2020 • Kwangjun Ahn, Chulhee Yun, Suvrit Sra
We study without-replacement SGD for solving finite-sum optimization problems.
no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.
no code implementations • ICML 2020 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
Attention based Transformer architecture has enabled significant advances in the field of natural language processing.
no code implementations • ICLR 2020 • Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.
no code implementations • 25 Sep 2019 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar
Attention based Transformer architecture has enabled significant advances in the field of natural language processing.
no code implementations • NeurIPS 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor.
no code implementations • NeurIPS 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
We also prove that width $\Theta(\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity.
no code implementations • ICLR 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast.
no code implementations • ICLR 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust.
no code implementations • ICLR 2018 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
We study the error landscape of deep linear and nonlinear neural networks with the squared error loss.