no code implementations • 30 May 2023 • Yingcong Li, Kartik Sreenivasan, Angeliki Giannou, Dimitris Papailiopoulos, Samet Oymak
These findings collectively provide insights into the mechanics of CoT, inviting further investigation of its role in complex reasoning tasks.
1 code implementation • 8 May 2023 • Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, Kangwook Lee
In this paper, we propose MPC (Modular Prompted Chatbot), a new approach for creating high-quality conversational agents without the need for fine-tuning.
1 code implementation • 4 May 2023 • Hongyi Wang, Saurabh Agarwal, Pongsakorn U-chupala, Yoshiki Tanaka, Eric P. Xing, Dimitris Papailiopoulos
Cuttlefish leverages the observation that after a few epochs of full-rank training, the stable rank (i. e., an approximation of the true rank) of each layer stabilizes at a constant value.
no code implementations • 15 Feb 2023 • Angeliki Giannou, Shashank Rajput, Dimitris Papailiopoulos
Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks.
1 code implementation • 30 Jan 2023 • Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos
We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop.
1 code implementation • 17 Jan 2023 • Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak
We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i. i. d.
no code implementations • 6 Oct 2022 • Liu Yang, Jifan Zhang, Joseph Shenouda, Dimitris Papailiopoulos, Kangwook Lee, Robert D. Nowak
For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective in which the regularization term is instead a sum of products of $\ell_2$ (not squared) norms of the input and output weights associated each ReLU.
1 code implementation • 14 Jun 2022 • Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee
LIFT does not make any changes to the model architecture or loss function, and it solely relies on the natural language interface, enabling "no-code machine learning with LMs."
1 code implementation • 23 May 2022 • Tuan Dinh, Jy-yong Sohn, Shashank Rajput, Timothy Ossowski, Yifei Ming, Junjie Hu, Dimitris Papailiopoulos, Kangwook Lee
Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods.
1 code implementation • 24 Feb 2022 • Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, Dimitris Papailiopoulos
Frankle & Carbin conjecture that we can avoid this by training "lottery tickets", i. e., special sparse subnetworks found at initialization, that can be trained to high accuracy.
no code implementations • 7 Jan 2022 • Jy-yong Sohn, Liang Shang, Hongxu Chen, Jaekyun Moon, Dimitris Papailiopoulos, Kangwook Lee
Mixup is a data augmentation method that generates new data points by mixing a pair of input data.
no code implementations • 18 Oct 2021 • Kartik Sreenivasan, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos
A recent work by Ramanujan et al. (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks.
no code implementations • NeurIPS 2021 • Shashank Rajput, Kartik Sreenivasan, Dimitris Papailiopoulos, Amin Karbasi
Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that \emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $\widetilde{\mathcal{O}}(e^{1/\delta^2}+\sqrt{n})$ neurons and $\widetilde{\mathcal{O}}(e^{1/\delta^2}(d+\sqrt{n})+n)$ weights, where $\delta$ is the minimum distance between the points.
1 code implementation • 5 Mar 2021 • Hongyi Wang, Saurabh Agarwal, Dimitris Papailiopoulos
In this work, we present Pufferfish, a communication and computation efficient distributed training framework that incorporates the gradient compression into the model training process via training low-rank, pre-factorized deep networks.
1 code implementation • 28 Feb 2021 • Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, Dimitris Papailiopoulos
A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training.
1 code implementation • ICLR 2022 • Shashank Rajput, Kangwook Lee, Dimitris Papailiopoulos
However, for general strongly convex functions, random permutations are optimal.
1 code implementation • NeurIPS 2020 • Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos
We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(log(dl))$ wider and twice as deep.
2 code implementations • 29 Oct 2020 • Saurabh Agarwal, Hongyi Wang, Kangwook Lee, Shivaram Venkataraman, Dimitris Papailiopoulos
The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model accuracy and per-iteration speedup.
2 code implementations • NeurIPS 2020 • Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, Dimitris Papailiopoulos
Due to its decentralized nature, Federated Learning (FL) lends itself to adversarial attacks in the form of backdoors during training.
1 code implementation • 14 Jun 2020 • Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos
We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(\log(dl))$ wider and twice as deep.
no code implementations • ICML 2020 • Shashank Rajput, Anant Gupta, Dimitris Papailiopoulos
A recent line of breakthrough works on SGD without replacement (SGDo) established an $\mathcal{O}\left(\frac{n}{T^2}\right)$ convergence rate when the function minimized is strongly convex and is a sum of $n$ smooth functions, and an $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^3}{T^3}\right)$ rate for sums of quadratics.
1 code implementation • ICLR 2020 • Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, Yasaman Khazaeni
Federated learning allows edge devices to collaboratively learn a shared model while keeping the training data on device, decoupling the ability to do model training from the need to store the data in the cloud.
1 code implementation • NeurIPS 2019 • Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos
In this work, we present DETOX, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation.
1 code implementation • NeurIPS 2020 • Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas
Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD).
no code implementations • 22 May 2019 • Zachary Charles, Shashank Rajput, Stephen Wright, Dimitris Papailiopoulos
Our results are derived by showing that adversarial training with gradient updates minimizes a robust version of the empirical risk at a $\mathcal{O}(\ln(t)^2/t)$ rate, despite non-smoothness.
no code implementations • 8 May 2019 • Shashank Rajput, Zhili Feng, Zachary Charles, Po-Ling Loh, Dimitris Papailiopoulos
Data augmentation (DA) is commonly used during model training, as it significantly improves test error and model robustness.
no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar
Machine learning (ML) techniques are enjoying rapidly increasing adoption.
1 code implementation • 28 Jan 2019 • Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos
We present ErasureHead, a new approach for distributed gradient descent (GD) that mitigates system delays by employing approximate gradient coding.
no code implementations • 8 Nov 2018 • Zachary Charles, Harrison Rosenberg, Dimitris Papailiopoulos
We show that these "transferable adversarial directions" are guaranteed to exist for linear separators of a given set, and will exist with high probability for linear classifiers trained on independent sets drawn from the same distribution.
1 code implementation • NeurIPS 2018 • Hongyi Wang, Scott Sievert, Zachary Charles, Shengchao Liu, Stephen Wright, Dimitris Papailiopoulos
We present ATOMO, a general framework for atomic sparsification of stochastic gradients.
no code implementations • NeurIPS 2018 • Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, Paraschos Koutris
Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training.
no code implementations • 25 May 2018 • Zachary Charles, Dimitris Papailiopoulos
Gradient descent and its many variants, including mini-batch stochastic gradient descent, form the algorithmic foundation of modern large-scale machine learning.
1 code implementation • ICML 2018 • Lingjiao Chen, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos
Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i. e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS).
no code implementations • 17 Nov 2017 • Zachary Charles, Dimitris Papailiopoulos, Jordan Ellenberg
Distributed algorithms are often beset by the straggler effect, where the slowest compute nodes in the system dictate the overall running time.
no code implementations • ICML 2018 • Zachary Charles, Dimitris Papailiopoulos
Finally, we show that although our results imply comparable stability for SGD and GD in the PL setting, there exist simple neural networks with multiple local minima where SGD is stable but GD is not.
no code implementations • 18 Jun 2017 • Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, Peter Bartlett
It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size.
1 code implementation • NeurIPS 2016 • Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael. I. Jordan, Kannan Ramchandran, Chris Re, Benjamin Recht
We present CYCLADES, a general framework for parallelizing stochastic optimization algorithms in a shared memory setting.
no code implementations • 9 Mar 2016 • Megasthenis Asteris, Anastasios Kyrillidis, Dimitris Papailiopoulos, Alexandros G. Dimakis
We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the number of clusters.
no code implementations • 8 Dec 2015 • Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, Kannan Ramchandran
We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling.
no code implementations • NeurIPS 2015 • Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis
Our algorithm relies on a novel approximation to the related Nonnegative Principal Component Analysis (NNPCA) problem; given an arbitrary data matrix, NNPCA seeks $k$ nonnegative components that jointly capture most of the variance.
no code implementations • NeurIPS 2015 • Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, Alexandros G. Dimakis
We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance.
no code implementations • 24 Jul 2015 • Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, Michael. I. Jordan
We demonstrate experimentally on a 16-core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.
no code implementations • 21 Jul 2015 • Siu On Chan, Dimitris Papailiopoulos, Aviad Rubinstein
It is well known that Sparse PCA (Sparse Principal Component Analysis) is NP-hard to solve exactly on worst-case instances.
no code implementations • NeurIPS 2015 • Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, Michael. I. Jordan
We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably.
no code implementations • 6 Apr 2014 • Dimitris Papailiopoulos, Anastasios Kyrillidis, Christos Boutsidis
We explain theoretically a curious empirical phenomenon: "Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate".