Search Results for author: Dimitris Papailiopoulos

Found 52 papers, 26 papers with code

CHAI: Clustered Head Attention for Efficient LLM Inference

1 code implementation12 Mar 2024 Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu

We observe that there is a high amount of redundancy across heads on which tokens they pay attention to.

How Well Can Transformers Emulate In-context Newton's Method?

no code implementations5 Mar 2024 Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, Jason D. Lee

Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression.

In-Context Learning regression

Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

2 code implementations6 Feb 2024 Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos

State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention.

In-Context Learning Language Modelling +1

Looped Transformers are Better at Learning Learning Algorithms

1 code implementation21 Nov 2023 Liu Yang, Kangwook Lee, Robert Nowak, Dimitris Papailiopoulos

Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al.

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

no code implementations12 Jul 2023 Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee

This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding.

Teaching Arithmetic to Small Transformers

1 code implementation7 Jul 2023 Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed.

Low-Rank Matrix Completion

Prompted LLMs as Chatbot Modules for Long Open-domain Conversation

1 code implementation8 May 2023 Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, Kangwook Lee

In this paper, we propose MPC (Modular Prompted Chatbot), a new approach for creating high-quality conversational agents without the need for fine-tuning.

Chatbot

Cuttlefish: Low-Rank Model Training without All the Tuning

1 code implementation4 May 2023 Hongyi Wang, Saurabh Agarwal, Pongsakorn U-chupala, Yoshiki Tanaka, Eric P. Xing, Dimitris Papailiopoulos

Cuttlefish leverages the observation that after a few epochs of full-rank training, the stable rank (i. e., an approximation of the true rank) of each layer stabilizes at a constant value.

The Expressive Power of Tuning Only the Normalization Layers

no code implementations15 Feb 2023 Angeliki Giannou, Shashank Rajput, Dimitris Papailiopoulos

Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks.

Looped Transformers as Programmable Computers

1 code implementation30 Jan 2023 Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos

We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop.

In-Context Learning

Transformers as Algorithms: Generalization and Stability in In-context Learning

2 code implementations17 Jan 2023 Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak

We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i. i. d.

Generalization Bounds In-Context Learning +3

PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized Deep Neural Networks

no code implementations6 Oct 2022 Liu Yang, Jifan Zhang, Joseph Shenouda, Dimitris Papailiopoulos, Kangwook Lee, Robert D. Nowak

Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness.

LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks

1 code implementation14 Jun 2022 Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee

LIFT does not make any changes to the model architecture or loss function, and it solely relies on the natural language interface, enabling "no-code machine learning with LMs."

BIG-bench Machine Learning General Classification +2

Rare Gems: Finding Lottery Tickets at Initialization

1 code implementation24 Feb 2022 Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, Dimitris Papailiopoulos

Frankle & Carbin conjecture that we can avoid this by training "lottery tickets", i. e., special sparse subnetworks found at initialization, that can be trained to high accuracy.

Finding Everything within Random Binary Networks

no code implementations18 Oct 2021 Kartik Sreenivasan, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos

A recent work by Ramanujan et al. (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks.

An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks

no code implementations NeurIPS 2021 Shashank Rajput, Kartik Sreenivasan, Dimitris Papailiopoulos, Amin Karbasi

Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that \emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $\widetilde{\mathcal{O}}(e^{1/\delta^2}+\sqrt{n})$ neurons and $\widetilde{\mathcal{O}}(e^{1/\delta^2}(d+\sqrt{n})+n)$ weights, where $\delta$ is the minimum distance between the points.

Memorization

Pufferfish: Communication-efficient Models At No Extra Cost

1 code implementation5 Mar 2021 Hongyi Wang, Saurabh Agarwal, Dimitris Papailiopoulos

In this work, we present Pufferfish, a communication and computation efficient distributed training framework that incorporates the gradient compression into the model training process via training low-rank, pre-factorized deep networks.

Quantization

On the Utility of Gradient Compression in Distributed Training Systems

1 code implementation28 Feb 2021 Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, Dimitris Papailiopoulos

A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training.

Model Compression

Permutation-Based SGD: Is Random Optimal?

1 code implementation ICLR 2022 Shashank Rajput, Kangwook Lee, Dimitris Papailiopoulos

However, for general strongly convex functions, random permutations are optimal.

Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient

1 code implementation NeurIPS 2020 Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos

We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(log(dl))$ wider and twice as deep.

Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

3 code implementations29 Oct 2020 Saurabh Agarwal, Hongyi Wang, Kangwook Lee, Shivaram Venkataraman, Dimitris Papailiopoulos

The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model accuracy and per-iteration speedup.

Quantization

Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

1 code implementation14 Jun 2020 Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos

We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(\log(dl))$ wider and twice as deep.

Closing the convergence gap of SGD without replacement

no code implementations ICML 2020 Shashank Rajput, Anant Gupta, Dimitris Papailiopoulos

A recent line of breakthrough works on SGD without replacement (SGDo) established an $\mathcal{O}\left(\frac{n}{T^2}\right)$ convergence rate when the function minimized is strongly convex and is a sum of $n$ smooth functions, and an $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^3}{T^3}\right)$ rate for sums of quadratics.

Federated Learning with Matched Averaging

1 code implementation ICLR 2020 Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, Yasaman Khazaeni

Federated learning allows edge devices to collaboratively learn a shared model while keeping the training data on device, decoupling the ability to do model training from the need to store the data in the cloud.

Federated Learning

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

1 code implementation NeurIPS 2019 Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

In this work, we present DETOX, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation.

Bad Global Minima Exist and SGD Can Reach Them

1 code implementation NeurIPS 2020 Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas

Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD).

Data Augmentation Image Classification

Convergence and Margin of Adversarial Training on Separable Data

no code implementations22 May 2019 Zachary Charles, Shashank Rajput, Stephen Wright, Dimitris Papailiopoulos

Our results are derived by showing that adversarial training with gradient updates minimizes a robust version of the empirical risk at a $\mathcal{O}(\ln(t)^2/t)$ rate, despite non-smoothness.

Does Data Augmentation Lead to Positive Margin?

no code implementations8 May 2019 Shashank Rajput, Zhili Feng, Zachary Charles, Po-Ling Loh, Dimitris Papailiopoulos

Data augmentation (DA) is commonly used during model training, as it significantly improves test error and model robustness.

Data Augmentation

ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

1 code implementation28 Jan 2019 Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

We present ErasureHead, a new approach for distributed gradient descent (GD) that mitigates system delays by employing approximate gradient coding.

A Geometric Perspective on the Transferability of Adversarial Directions

no code implementations8 Nov 2018 Zachary Charles, Harrison Rosenberg, Dimitris Papailiopoulos

We show that these "transferable adversarial directions" are guaranteed to exist for linear separators of a given set, and will exist with high probability for linear classifiers trained on independent sets drawn from the same distribution.

The Effect of Network Width on the Performance of Large-batch Training

no code implementations NeurIPS 2018 Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, Paraschos Koutris

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training.

Gradient Coding via the Stochastic Block Model

no code implementations25 May 2018 Zachary Charles, Dimitris Papailiopoulos

Gradient descent and its many variants, including mini-batch stochastic gradient descent, form the algorithmic foundation of modern large-scale machine learning.

Stochastic Block Model

DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

1 code implementation ICML 2018 Lingjiao Chen, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i. e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS).

Approximate Gradient Coding via Sparse Random Graphs

no code implementations17 Nov 2017 Zachary Charles, Dimitris Papailiopoulos, Jordan Ellenberg

Distributed algorithms are often beset by the straggler effect, where the slowest compute nodes in the system dictate the overall running time.

Stability and Generalization of Learning Algorithms that Converge to Global Optima

no code implementations ICML 2018 Zachary Charles, Dimitris Papailiopoulos

Finally, we show that although our results imply comparable stability for SGD and GD in the PL setting, there exist simple neural networks with multiple local minima where SGD is stable but GD is not.

Generalization Bounds

Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

no code implementations18 Jun 2017 Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, Peter Bartlett

It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size.

Diversity Quantization

Bipartite Correlation Clustering -- Maximizing Agreements

no code implementations9 Mar 2016 Megasthenis Asteris, Anastasios Kyrillidis, Dimitris Papailiopoulos, Alexandros G. Dimakis

We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the number of clusters.

Clustering

Speeding Up Distributed Machine Learning Using Codes

no code implementations8 Dec 2015 Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, Kannan Ramchandran

We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling.

BIG-bench Machine Learning

Orthogonal NMF through Subspace Exploration

no code implementations NeurIPS 2015 Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis

Our algorithm relies on a novel approximation to the related Nonnegative Principal Component Analysis (NNPCA) problem; given an arbitrary data matrix, NNPCA seeks $k$ nonnegative components that jointly capture most of the variance.

Clustering

Sparse PCA via Bipartite Matchings

no code implementations NeurIPS 2015 Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, Alexandros G. Dimakis

We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance.

Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

no code implementations24 Jul 2015 Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, Michael. I. Jordan

We demonstrate experimentally on a 16-core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.

Stochastic Optimization

On the Worst-Case Approximability of Sparse PCA

no code implementations21 Jul 2015 Siu On Chan, Dimitris Papailiopoulos, Aviad Rubinstein

It is well known that Sparse PCA (Sparse Principal Component Analysis) is NP-hard to solve exactly on worst-case instances.

Parallel Correlation Clustering on Big Graphs

no code implementations NeurIPS 2015 Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, Michael. I. Jordan

We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably.

Clustering

Provable Deterministic Leverage Score Sampling

no code implementations6 Apr 2014 Dimitris Papailiopoulos, Anastasios Kyrillidis, Christos Boutsidis

We explain theoretically a curious empirical phenomenon: "Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate".

Cannot find the paper you are looking for? You can Submit a new open access paper.