Search Results for author: Srinivas Sridharan

Found 12 papers, 2 papers with code

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

3 code implementations24 Mar 2023 William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms.

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

no code implementations16 Dec 2022 Mingyu Liang, Wenyin Fu, Louis Feng, Zhongyi Lin, Pavani Panakanti, Shengbao Zheng, Srinivas Sridharan, Christina Delimitrou

We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models, both in execution time and system-level metrics.

Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

no code implementations22 Jul 2022 Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali, Aditya Akella, Tushar Krishna

Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.

Blocking

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

no code implementations9 Oct 2021 Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e. g., GPU/TPU).

Scheduling

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

no code implementations20 Mar 2020 Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang, Mikhail Smelyanskiy

Large-scale training is important to ensure high performance and accuracy of machine-learning models.

Distributed, Parallel, and Cluster Computing 68T05, 68M10 H.3.3; I.2.6; C.2.1

Automatic Model Parallelism for Deep Neural Networks with Compiler and Hardware Support

no code implementations11 Jun 2019 Sanket Tavarageri, Srinivas Sridharan, Bharat Kaul

We model the computation and communication costs of a dataflow graph that embodies the neural network training process and then, partition the graph using heuristics in such a manner that the communication between compute devices is minimal and we have a good load balance.

Scheduling Translation

On Scale-out Deep Learning Training for Cloud and HPC

no code implementations24 Jan 2018 Srinivas Sridharan, Karthikeyan Vaidyanathan, Dhiraj Kalamkar, Dipankar Das, Mikhail E. Smorkalov, Mikhail Shiryaev, Dheevatsa Mudigere, Naveen Mellempudi, Sasikanth Avancha, Bharat Kaul, Pradeep Dubey

The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes.

Philosophy

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

no code implementations22 Feb 2016 Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, Pradeep Dubey

We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior.

Cannot find the paper you are looking for? You can Submit a new open access paper.