Search Results for author: Amar Phanishayee

Found 16 papers, 7 papers with code

Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving

no code implementations30 Nov 2023 Ankit Bhardwaj, Amar Phanishayee, Deepak Narayanan, Mihail Tarta, Ryan Stutsman

We present Packrat, a new serving system for online inference that given a model and batch size ($B$) algorithmically picks the optimal number of instances ($i$), the number of threads each should be allocated ($t$), and the batch sizes each should operate on ($b$) that minimizes latency.

A Study on the Intersection of GPU Utilization and CNN Inference

no code implementations15 Dec 2022 Jack Kosaian, Amar Phanishayee

Achieving high GPU utilization is critical to increasing application-level throughput and ensuring a good return on investment for deploying GPUs.

Neural Architecture Search

Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers

1 code implementation2 Feb 2022 Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, Nam Sung Kim

Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to develop and train such models.

Piper: Multidimensional Planner for DNN Parallelization

no code implementations NeurIPS 2021 Jakub M. Tarnawski, Deepak Narayanan, Amar Phanishayee

The rapid increase in sizes of state-of-the-art DNN models, and consequently the increase in the compute and memory requirements of model training, has led to the development of many execution schemes such as data parallelism, pipeline model parallelism, tensor (intra-layer) model parallelism, and various memory-saving optimizations.

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

no code implementations12 Oct 2021 Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, Vijay Chidambaram

Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources.


Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

1 code implementation9 Apr 2021 Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick Legresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia

In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters.

Language Modelling

Analyzing and Mitigating Data Stalls in DNN Training

no code implementations14 Jul 2020 Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, Vijay Chidambaram

We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft.

Efficient Algorithms for Device Placement of DNN Graph Operators

1 code implementation NeurIPS 2020 Jakub Tarnawski, Amar Phanishayee, Nikhil R. Devanur, Divya Mahajan, Fanny Nina Paravecino

However, for such settings (large models and multiple heterogeneous devices), we require automated algorithms and toolchains that can partition the ML workload across devices.

Memory-Efficient Pipeline-Parallel DNN Training

1 code implementation16 Jun 2020 Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, Matei Zaharia

Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models.

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training

no code implementations5 Jun 2020 Hongyu Zhu, Amar Phanishayee, Gennady Pekhimenko

Modern deep neural network (DNN) training jobs use complex and heterogeneous software/hardware stacks.

Blink: Fast and Generic Collectives for Distributed ML

no code implementations11 Oct 2019 Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, Ion Stoica

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale.

Image Classification

The Non-IID Data Quagmire of Decentralized Machine Learning

1 code implementation ICML 2020 Kevin Hsieh, Amar Phanishayee, Onur Mutlu, Phillip B. Gibbons

Our study shows that: (i) skewed data labels are a fundamental and pervasive problem for decentralized learning, causing significant accuracy loss across many ML applications, DNN models, training datasets, and decentralized learning algorithms; (ii) the problem is particularly challenging for DNN models with batch normalization; and (iii) the degree of data skew is a key determinant of the difficulty of the problem.

BIG-bench Machine Learning

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

1 code implementation17 Jan 2019 Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products.

Distributed, Parallel, and Cluster Computing

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

1 code implementation8 Jun 2018 Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons

PipeDream is a Deep Neural Network(DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines.

Distributed, Parallel, and Cluster Computing

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

no code implementations21 May 2018 Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud.

TBD: Benchmarking and Analyzing Deep Neural Network Training

no code implementations16 Mar 2018 Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Amar Phanishayee, Bianca Schroeder, Gennady Pekhimenko

Our primary goal in this work is to break this myopic view by (i) proposing a new benchmark for DNN training, called TBD (TBD is short for Training Benchmark for DNNs), that uses a representative set of DNN models that cover a wide range of machine learning applications: image classification, machine translation, speech recognition, object detection, adversarial networks, reinforcement learning, and (ii) by performing an extensive performance analysis of training these different applications on three major deep learning frameworks (TensorFlow, MXNet, CNTK) across different hardware configurations (single-GPU, multi-GPU, and multi-machine).

Benchmarking General Classification +6

Cannot find the paper you are looking for? You can Submit a new open access paper.