1 code implementation • 18 Jul 2024 • Irene Wang, Jakub Tarnawski, Amar Phanishayee, Divya Mahajan
Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy.
1 code implementation • 18 Jul 2024 • Seonho Lee, Amar Phanishayee, Divya Mahajan
To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution.
no code implementations • 30 Nov 2023 • Ankit Bhardwaj, Amar Phanishayee, Deepak Narayanan, Mihail Tarta, Ryan Stutsman
We present Packrat, a new serving system for online inference that given a model and batch size ($B$) algorithmically picks the optimal number of instances ($i$), the number of threads each should be allocated ($t$), and the batch sizes each should operate on ($b$) that minimizes latency.
no code implementations • 15 Dec 2022 • Jack Kosaian, Amar Phanishayee
Achieving high GPU utilization is critical to increasing application-level throughput and ensuring a good return on investment for deploying GPUs.
1 code implementation • 2 Feb 2022 • Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, Nam Sung Kim
Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to develop and train such models.
no code implementations • NeurIPS 2021 • Jakub M. Tarnawski, Deepak Narayanan, Amar Phanishayee
The rapid increase in sizes of state-of-the-art DNN models, and consequently the increase in the compute and memory requirements of model training, has led to the development of many execution schemes such as data parallelism, pipeline model parallelism, tensor (intra-layer) model parallelism, and various memory-saving optimizations.
no code implementations • 12 Oct 2021 • Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, Vijay Chidambaram
Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources.
3 code implementations • 9 Apr 2021 • Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick Legresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia
In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters.
no code implementations • 14 Jul 2020 • Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, Vijay Chidambaram
We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft.
1 code implementation • NeurIPS 2020 • Jakub Tarnawski, Amar Phanishayee, Nikhil R. Devanur, Divya Mahajan, Fanny Nina Paravecino
However, for such settings (large models and multiple heterogeneous devices), we require automated algorithms and toolchains that can partition the ML workload across devices.
1 code implementation • 16 Jun 2020 • Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, Matei Zaharia
Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models.
no code implementations • 5 Jun 2020 • Hongyu Zhu, Amar Phanishayee, Gennady Pekhimenko
Modern deep neural network (DNN) training jobs use complex and heterogeneous software/hardware stacks.
no code implementations • 11 Oct 2019 • Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, Ion Stoica
Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale.
1 code implementation • ICML 2020 • Kevin Hsieh, Amar Phanishayee, Onur Mutlu, Phillip B. Gibbons
Our study shows that: (i) skewed data labels are a fundamental and pervasive problem for decentralized learning, causing significant accuracy loss across many ML applications, DNN models, training datasets, and decentralized learning algorithms; (ii) the problem is particularly challenging for DNN models with batch normalization; and (iii) the degree of data skew is a key determinant of the difficulty of the problem.
1 code implementation • 17 Jan 2019 • Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang
With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products.
Distributed, Parallel, and Cluster Computing
1 code implementation • 8 Jun 2018 • Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons
PipeDream is a Deep Neural Network(DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines.
Distributed, Parallel, and Cluster Computing
no code implementations • 21 May 2018 • Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy
Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud.
no code implementations • 16 Mar 2018 • Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Amar Phanishayee, Bianca Schroeder, Gennady Pekhimenko
Our primary goal in this work is to break this myopic view by (i) proposing a new benchmark for DNN training, called TBD (TBD is short for Training Benchmark for DNNs), that uses a representative set of DNN models that cover a wide range of machine learning applications: image classification, machine translation, speech recognition, object detection, adversarial networks, reinforcement learning, and (ii) by performing an extensive performance analysis of training these different applications on three major deep learning frameworks (TensorFlow, MXNet, CNTK) across different hardware configurations (single-GPU, multi-GPU, and multi-machine).