no code implementations • 15 Feb 2023 • Angeliki Giannou, Shashank Rajput, Dimitris Papailiopoulos
Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks.
1 code implementation • 30 Jan 2023 • Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos
We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop.
1 code implementation • 14 Jun 2022 • Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee
LIFT does not make any changes to the model architecture or loss function, and it solely relies on the natural language interface, enabling "no-code machine learning with LMs."
1 code implementation • 23 May 2022 • Tuan Dinh, Jy-yong Sohn, Shashank Rajput, Timothy Ossowski, Yifei Ming, Junjie Hu, Dimitris Papailiopoulos, Kangwook Lee
Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods.
no code implementations • ICLR 2022 • Chulhee Yun, Shashank Rajput, Suvrit Sra
In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods.
no code implementations • 18 Oct 2021 • Kartik Sreenivasan, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos
A recent work by Ramanujan et al. (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks.
no code implementations • NeurIPS 2021 • Shashank Rajput, Kartik Sreenivasan, Dimitris Papailiopoulos, Amin Karbasi
Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that \emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $\widetilde{\mathcal{O}}(e^{1/\delta^2}+\sqrt{n})$ neurons and $\widetilde{\mathcal{O}}(e^{1/\delta^2}(d+\sqrt{n})+n)$ weights, where $\delta$ is the minimum distance between the points.
1 code implementation • ICLR 2022 • Shashank Rajput, Kangwook Lee, Dimitris Papailiopoulos
However, for general strongly convex functions, random permutations are optimal.
1 code implementation • NeurIPS 2020 • Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos
We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(log(dl))$ wider and twice as deep.
2 code implementations • NeurIPS 2020 • Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, Dimitris Papailiopoulos
Due to its decentralized nature, Federated Learning (FL) lends itself to adversarial attacks in the form of backdoors during training.
1 code implementation • 14 Jun 2020 • Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos
We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(\log(dl))$ wider and twice as deep.
no code implementations • ICML 2020 • Shashank Rajput, Anant Gupta, Dimitris Papailiopoulos
A recent line of breakthrough works on SGD without replacement (SGDo) established an $\mathcal{O}\left(\frac{n}{T^2}\right)$ convergence rate when the function minimized is strongly convex and is a sum of $n$ smooth functions, and an $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^3}{T^3}\right)$ rate for sums of quadratics.
1 code implementation • NeurIPS 2019 • Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos
In this work, we present DETOX, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation.
no code implementations • 22 May 2019 • Zachary Charles, Shashank Rajput, Stephen Wright, Dimitris Papailiopoulos
Our results are derived by showing that adversarial training with gradient updates minimizes a robust version of the empirical risk at a $\mathcal{O}(\ln(t)^2/t)$ rate, despite non-smoothness.
no code implementations • 8 May 2019 • Shashank Rajput, Zhili Feng, Zachary Charles, Po-Ling Loh, Dimitris Papailiopoulos
Data augmentation (DA) is commonly used during model training, as it significantly improves test error and model robustness.